Finding, reusing, and citing data
May 18th, 2011
Challenges to data reuse
| Discovery | Many datasets are not findable through Google. Search journals for references to datasets, or ask a librarian to search for relevant datasets. |
| Acquisition | Even when authors say they offer dataset access, they may not respond to requests. Be persistent. |
| Cost | If you need a for-pay dataset, contact your liaison librarian to see whether the library can purchase it. |
| Licensing | Some dataset owners impose stringent reuse requirements. Read any data license carefully to be sure you will be allowed the use you want. |
| Security | Datasets with licensing or other confidentiality restrictions on access need strong digital security. If you need to safeguard a dataset, talk to the Office of Campus Information Security. |
Finding data for reuse
Valuable digital data live many places:
| The open web | Government websites often include useful datasets, and some university researchers are also choosing to open their data for reuse. |
| Discipline-based data repositories | If you're not sure whether a repository exists in your research area, ask a librarian. |
| Campus-based data repositories | Often these are findable on the open web. |
| With researchers or research groups | These are hardest to locate and gain access to. The best route may be through contacting authors of relevant publications. |
Reusing data
Be prepared for:
| Data cleanup | Many datasets are poorly organized, or only available in difficult-to-reuse forms. Google Refine or the Data Science Toolkit may be useful cleanup tools. |
| Data interpretation difficulties | Many datasets lack data dictionaries and other necessary documentation. You may have to work with the data creator to understand the dataset at all. |
| Data disappearance | Like any other website, online datasets can and do disappear without warning. Data formally deposited in data repositories tend to persist longest. |
Citing data
Data citation standards do not exist in many disciplines, though the DataCite initiative is working on them. Current workarounds include:
- Citing a “data paper,” where available.
- Citing a journal article describing the dataset.
- Citing the dataset as a website, where possible.
Cite datasets for the same reasons you cite books and journal articles: for dataset creators to receive appropriate credit for their work, and to make clear the antecedents to your research.