Reasons to share your data
Data sharing encourages more collaboration between researchers and disciplines, cuts down on duplication of already existing research, enables reproducibility, and benefits the public. The following are some other reasons to share your data:
To raise interest in publications. One recent study found a 25% increase in citations for articles whose associated data were available online.
To speed research. Particularly in complex fields, data sharing can accelerate discovery rates, as researchers into Alzheimer’s disease and researchers at UW-Madison studying the Zika virus discovered.
To establish priority. Data shared in a repository online can be time stamped to establish the date they were produced, blocking “scooping” tactics.
To fulfill funder and journal requirements. Grant funders and (in some disciplines) journals may require data sharing. If you have questions about data sharing or about writing data management plans for grants, contact us.
Considerations before sharing
Before sharing your data, it is important to know the expectations, standards, policies, and laws that affect your data. The following are a few key questions to consider before you share. For more information please see our Responsible Data Planning, Use, and Sharing micro-course.
Restrictions. Does your data contain confidential, sensitive, or private personal information? If you anonymize, can individuals in the dataset be reidentified? Are there any legal or intellectual property-related restrictions?
Documentation. Are your datasets understandable to those who wish to use them? Have you included all the metadata, methodology descriptions, codebooks, data dictionaries, and other descriptive material that someone looking at the dataset for the first time would need?
Standards. Do your datasets comply with description, format, metadata, and sharing standards in your field?
Licensing. What reuse policies do you wish for your data? Consider the Open Knowledge Foundation’s definition of open data carefully before you attach reuse restrictions. For more information on data and licensing, see Introduction to intellectual property rights in data management from Cornell University.
The FAIR Principles provide guidelines to preparing and sharing your data to improve the reusability and the machine-actionability of your data.
These 4 guiding principles in FAIR focus on steps for making your data Findable, Accessible, Interoperable, and Reusable. The guidelines include further actions that detail some specific best practices that can help you better describe and license your data as well as help guide your choice of an appropriate repository. Learn more about making your data FAIR and the history of the principles.
Choosing a repository
Relying on your own, your lab’s, your department’s, or even some campus-wide IT resources or services for long-term archiving can be risky. Unless the service offers an explicit commitment to long-term preservation of content, your data is liable to disappear if you change institutions or retire, funding stops, or technology policies or platforms change. It is for this reason that we recommend choosing a repository with clear policies as a long-term solution.
Choosing the right repository to share your data will depend on what is appropriate for your research. The following are some key questions to keep in mind when considering whether a repository makes your data FAIR and fits your needs:
Specialization. Are there any repositories widely used by other researchers in your discipline? Is there a repository designated by your funding agency or publisher?
Longevity. What do their preservation policies say about the level of preservation they provide? How long do they commit to preserving your data? What organizations or institutions provide management for the repository? Is there a preservation fund?
Findable/Accessible. How will the repository help others find your data? What metadata does the repository allow or require you to use? (This is good to know at the beginning of your project to ensure you are collecting and documenting the appropriate metadata.) Is it open and standardized? Can the repository provide a DOI for your data? (A DOI is a digital object identifier that is unique, doesn’t change, and can be used to access the object.) Does their infrastructure enable long term reuse?
Interoperability/Reuse. How well do the repository’s policies and requirements fit your needs? Do their data embargo, access, and licensing options fit your funder or publisher requirements? What are the final file formats you are producing and can the repository support them?
Places to share
Disciplinary repository. If there are well-known disciplinary data repositories in your field, we recommend prioritizing them as options for data sharing as this will increase its discoverability. Examples include ICPSR and the National Snow and Ice Data Center. To browse data repository options for various disciplines, check out DataCite’s Re3Data.org, the Registry of Research Data Repositories.
When researchers cannot find a repository specific to their discipline, an institutional repository or generalist repository may be the best option. Please check repository pages for up-to-date features and information or contact us with any questions.
Institutional repository. A campus repository for scholarly outputs from the university with a commitment to long-term preservation. At UW-Madison, we have MINDS@UW for data from any UW-Madison affiliated researcher and the Data and Information Services Center’s Online Data Archive for Social Sciences datasets
- MINDS@UW is an open access repository free to anyone affiliated with the University of Wisconsin. It is designed to store, index, distribute, and preserve the scholarly materials of the University of Wisconsin. Content, which is deposited directly by UW faculty and staff, may include research papers, pre-prints, datasets, photographs, videos, theses, conference papers, or other intellectual property in digital form.
Generalist repositories. Generalist repositories accept data regardless of the disciplinary focus or data type. Dryad, Zenodo and OSF (Open Science Framework) are good options for generalist repositories, as they accept submissions from any discipline and have focused on infrastructure and preservation plans.
- Dryad provides a subject-agnostic, general-purpose home for a wide diversity of data types. It is a non-profit and works on a cost recovery model so does not rely on funding. Curators ensure the validity of data and apply robust metadata to enhance discoverability. Data versioning is available if possible and you can also request that your data be embargoed until it needs to be released. Items are retained indefinitely and Dryad provides integrity checks, checksums, and ensures files remain uncorrupted.
- Zenodo is a discipline agnostic repository that accepts all types of “research artifacts” including data at any stage of the research lifecycle. Originally funded by the EU, it is open to everyone for use. It will accept any file format, however, we recommend using sustainable formats. Files can be versioned and may also be deposited under closed, open, or embargoed access. Build on the CERN infrastructure, they promise to retain deposits for the “lifetime of the repository,” which is currently the lifetime of CERN, so about 20 years and will “make best efforts” to migrate content to other repositories in case of Zenodo closing.
- OSF (Open Science Framework) is a free and open source tool built by the nonprofit Center for Open Science which receives both federal and private funding. It can be used as a public or private collaborative research workspace as well as a public dissemination and archival tool. It has many third party integrations such as GitHub, figshare, Box, Drive, etc
- For more information see their FAQ page
- Harvard Dataverse is a collaboration of Harvard University IT and the Institute for Quantitative Social Sciences. It is available to researchers and data collectors worldwide from all disciplines. In addition to sharing data in the the Harvard Dataverse repository, researchers also have the option to create a Dataverse (virtual archive) on their own website, which will be served up and made discoverable by the Dataverse repository. Harvard Dataverse supports version control, however, once published, the dataset cannot be unpublished.
- Figshare is a discipline agnostic broad repository that accepts many types of academic research outputs including data, posters, presentations, theses, media, etc. It also allows for the creation of collections and collaborative spaces, supports version control, accepts any file format, and has integration with other tools like GitHub. Individual accounts are free. Figshare also offers services for publishers and institutions.
Journals. You can also share your data as supplementary materials or a “data publication” in an appropriate journal. Check with journals about their data policies.
Sharing the code for your data cleaning, analyses, or figures alongside your data and other supporting materials is critical to ensuring the reproducibility of your research. It may also be required by your funder or publisher. The Five Recommendations for applying the FAIR principles to software include using a publicly accessible repository with version control, adding a license, registering your code in a community registry, enabling citation, and using a software quality checklist. We expand upon these suggestions and add a few additional points to consider when sharing code:
Best practices. When sharing code, it is good practice to document the information necessary for others to use, cite, and reproduce your work. Commenting in your code is also important for others to be able to understand your code and what it does. Additionally, using open source tools and accessible formats like plain text files will also help ensure the long term usability of your code. For more information on making your code reproducible and reusable see the Planning for Software Reproducibility and Reuse guide from Johns Hopkins.
Standards and APIs. When possible, use open and widely-adopted standards and APIs for structuring both your code and your data. Include references to standards and APIs used for representing your data in your documentation, or even better, include a copy of the documents describing the standards (RFCs, ISOs, etc.) along with your code, as standards are also subject to change and disappearance over time. If you create your own data representations and structures, document them! If you create your own API, document it! If someone can understand your data, and the methods used to manipulate the data as described in an API, then they can understand, and even reproduce, your code.
Licensing and intellectual property. Attaching a license to your code lets others know how they can use it. If you are interested in using an open source license for your work, you can use this tool from GitHub to select one that might work best for you. Prior to sharing your code, make sure that you understand applicable UW-Madison intellectual property rights and invention disclosure regarding computer software. Contact a WARF intellectual property manager for more information.
Where to share. Much like with sharing data, your research field, funders, or journals may have preferred places for you to share your code. Publishers or other institutions may have their own platforms for sharing code, such as CodeOcean, GitLab, or BitBucket. One of the most commonly used code repositories is GitHub, which is a software development platform with version control. You may also be able to share your code alongside your data in repositories like Zenodo, OSF, and Figshare.
Preservation/Citability. It is important to note that although the word “repository” is used for code repositories, it isn’t being used in the same way as data and preprint repositories that have expressed commitments to long-term preservation. Code repositories are focused on providing access and version control for projects, as opposed to archival repositories, which are more concerned with long-term preservation. Each type of repository is not a substitute for the other and you will likely need both. Users must be aware of limitations on preservation with individual code repositories and make sure to understand user agreements or preservation plans. Combining sharing in a code repository with sharing in an archival repository improves long term citability. GitHub has partnered with Zenodo to archive GitHub code repositories and assign DOIs to make them citable.
Computational environments. Saving elements of the computational environment can make it easier to run code in the proper environment and make your code more portable. Containers like Docker package code and dependencies such as system tools, system libraries, and settings together so that software can run quickly in other computing environments. Virtual machines like Virtual Box or Vagrant replicate hardware, package code, a full copy of the operating system, and necessary libraries and other dependencies. ReproZip is a free open source tool that reproduces packages in a variety of ways, including with containers and virtual machines. As a note, containers and virtual machines are not a substitute for preserving code and documentation. They should only be used in addition to preserving more durable artifacts outside of the container/virtual machine. For more information or if you have questions, please contact RDS.
These people and services can help you make data-sharing decisions:
Tenopir C, Rice N.M, Allard S, Baird L, et al. (2020) Data sharing, management, use, and reuse: Practices and perceptions of scientists worldwide. PLoS One.
Edmund J, Tóth-Czifra, E. (2018) Open Data for Humanists, A Pragmatic Guide.