Finding, reusing, and citing data

May 18th, 2011

Challenges to data reuse

DiscoveryMany datasets are not findable through Google. Search journals for references to datasets, or ask a librarian to search for relevant datasets.
AcquisitionEven when authors say they offer dataset access, they may not respond to requests. Be persistent.
Cost If you need a for-pay dataset, contact your liaison librarian to see whether the library can purchase it.
LicensingSome dataset owners impose stringent reuse requirements. Read any data license carefully to be sure you will be allowed the use you want.
SecurityDatasets with licensing or other confidentiality restrictions on access need strong digital security. If you need to safeguard a dataset, talk to the Office of Campus Information Security.

Finding data for reuse

Valuable digital data live many places:

The open webGovernment websites often include useful datasets, and some university researchers are also choosing to open their data for reuse.
Discipline-based data repositoriesIf you're not sure whether a repository exists in your research area, ask a librarian.
Campus-based data repositoriesOften these are findable on the open web.
With researchers or research groupsThese are hardest to locate and gain access to. The best route may be through contacting authors of relevant publications.

Reusing data

Be prepared for:

Data cleanupMany datasets are poorly organized, or only available in difficult-to-reuse forms. Google Refine or the Data Science Toolkit may be useful cleanup tools.
Data interpretation difficultiesMany datasets lack data dictionaries and other necessary documentation. You may have to work with the data creator to understand the dataset at all.
Data disappearanceLike any other website, online datasets can and do disappear without warning. Data formally deposited in data repositories tend to persist longest.

Citing data

Data citation standards do not exist in many disciplines, though the DataCite initiative is working on them. Current workarounds include:

  • Citing a “data paper,” where available.
  • Citing a journal article describing the dataset.
  • Citing the dataset as a website, where possible.

Cite datasets for the same reasons you cite books and journal articles: for dataset creators to receive appropriate credit for their work, and to make clear the antecedents to your research.