Are you looking for datasets to help you in your research or to use in a class assignment? We have compiled the following resource to help you find, reuse, and cite datasets across a variety of disciplines. We’ve also included a few best practices you’ll want to keep in mind in order to reuse datasets responsibly.
Datasets that have been made openly available in response to the COVID-19 pandemic can be found here.
Tips for reusing data
Discovery: While some datasets may be findable via Google, it is best to search disciplinary and generalist data repositories. Search journals for references to datasets, search the library databases using our RDS Guide to Finding Datasets, or ask a subject librarian to help you search for relevant datasets.
Acquisition: Acquiring data will depend on the method it is shared. Some data are openly available for immediate use via repositories, while other data may be available by request in restricted repositories. Some journals now require data to be made available at the time of publication, so it is worth checking the supplementary information for links to the data. If data is only available via an author, you can always reach out to an author and request their dataset. But even when authors say they offer dataset access, they may not respond to requests. You may need to be persistent.
Cost: Some proprietary or restricted datasets may have costs associated with them. Please contact your department’s liaison librarian or Research Data Services email@example.com for assistance.
Licensing: Some datasets and resources may come with reuse requirements. Read any data license carefully to be sure you will be allowed the use you want. For more information on data licenses see Cornell’s Introduction to intellectual property rights in data management.
Security: Datasets with confidentiality restrictions, health information, or licensing requirements on access need strong digital security. If you need to safeguard a dataset, talk to the Office of Cybersecurity. Datasets with protected health information need to be in compliance with UW-Madison HIPAA Policies. Please see the Office of Compliance for help with HIPAA restricted datasets, the HIPAA Security Program page for information on security requirements, and the HIPAA – Training page for information on training and requirements.
Finding data for reuse
Valuable digital data live many places:
Data from government/non-profit organizations. Government websites often include useful datasets, and some university researchers are also choosing to open their data for reuse. Below are links to sites that host open data, including government websites. Some also share helpful resources for data management, analysis, and instruction.
- Data.gov hosts over 235,000 open machine readable datasets from the U.S. government related to a wide range of topics including agriculture, manufacturing, health, public safety, and energy among others.
- The NIH’s National Library of Medicine maintains a list of domain-specific and general repositories for sharing and accessing datasets for reuse.
- The NEA maintains a list of publicly available data sources for datasets related to arts and culture including some of those from the National Archive of Data on Arts and Culture.
- The US Geological Survey is a primary source of geographic information system (GIS) data.
- The National Historic Geographic Information System provides access to summary tables and time series of population, housing, agriculture, and economic data, along with GIS-compatible boundary files, for years from 1790 through the present and for all levels of U.S. census geography.
- ISIDORE is an online platform that provides access to open digital data for the humanities and social sciences and is the largest open digital library in English, Spanish, and French. MLA CORE is a repository for digital humanities materials including course materials, white papers, conference papers, data, code, and digital projects.
- You can also use the Open Access Directory of Data Repositories to find open access datasets, organized by discipline.
Discipline-based data repositories. If you’re not sure whether a repository exists in your research area, you can search re3data, a global registry of research data repositories, or ask a librarian as they will have suggested resources for you.
- DataOne is a community driven project that provides access to data from multiple member repositories, focusing on Earth and environmental data. It also hosts several helpful data management learning modules and webinars.
- ICPSR, (the Inter-university Consortium for Political and Social Research) which consists of over 750 academic institutions, provides access to datasets for the social and behavioral sciences research community as well as offering training for data curation and methods of analysis and resources for instructors and students on teaching and learning with data.
Search engine tools. We are beginning to see a growth in dataset search tools. Some index open data and some include a mix of open datasets and datasets behind paywalls. It is recommended that you begin with more comprehensive search tools and strategies before using more general search engines.
- DataSearch by Elsevier is a research data search engine that indexes mostly open data repositories.
- Google Dataset Search allows users to search for datasets available online through a simple keyword search. Google Public Data Explorer provides public data and forecasts from a range of international organizations and academic institutions alongside data visualization tools.
Institutional repositories. Many academic institutions have repositories for data and research outputs such as papers, presentations, or theses that are typically findable on the open web.
- MINDS@UW is an open access repository designed to gather, distribute, and preserve digital materials related to the University of Wisconsin’s research and instructional mission. If you are interested in adding your dataset to MINDS@UW, contact us.
With researchers or research groups. These are hardest to locate and gain access to. The best route may be through contacting authors of relevant publications.
Datasets at UW-Madison. Below is a list of some of the data portals, repositories, and discipline specific resources you can find at UW-Madison. Remember to be mindful of data licenses. Please contact us to have your UW-Madison repository listed.
- The Environmental Data Initiative is an NSF-funded project at the UW Madison accelerating curation and archive of environmental data, emphasizing data from projects funded by the NSF DEB. In addition to a data portal for ecological and environmental data packages, it also includes links to resources for data publishing, data management, and a YouTube channel with presentations.
- The Neotoma Paleoecology Database is a collaborative community repository for paleoecological data developed by a consortium of PIs (including UW Madison’s Jack Williams) and institutions around the world. It also includes educational activities for students and resources such as a code repository, R packages, and links to third party apps and visualizations.
- The Biological Magnetic Resonance Data Bank is a repository of data from NMR Spectroscopy on proteins, peptides, nucleic acids, and other biomolecules.
- The IceCube Neutrino Observatory, led by the University of Wisconsin Madison and spanning a collaboration across 52 institutions and 12 countries, gathers data from a detector in the South Pole, with the goal of exploring the nature of dark matter, the neutrino, and observing cosmic rays. Their datasets are publicly accessible and can be downloaded directly from the site.
- Silvis Lab at the University of Wisconsin Madison has a variety of maps and data used in spatial analysis for conservation and sustainability.
When reusing data for research you may need to be prepared for:
Data cleanup. Some datasets are poorly organized, or only available in difficult-to-reuse forms.
Data interpretation difficulties. Some datasets lack data dictionaries and other necessary documentation. You may have to work with the data creator to understand the dataset. To learn how to make your own data more accessible check out our page on data documentation.
Data disappearance. Like any other website, online datasets can and do disappear without warning. Data formally deposited in data repositories tend to live longest. Always try to cite data by persistent identifiers like DOIs or handles.
Cite datasets for the same reasons you cite books and journal articles: for dataset creators to receive appropriate credit for their work, and to make clear the antecedents to your research.
Data citation standards are still emerging in many disciplines, but you can cite data much like you would cite any other work. Some disciplinary style manuals, professional organizations, or journals and repositories will have guidance on preferred data citation formats, while others may not.
ICPSR suggests the minimum elements of data citation as:
Persistent identifier (such as the Digital Object Identifier, Uniform Resource Name URN, or
Persistent identifiers are preferred, but if not available a URL will suffice.
Other information on data citation: