Data becomes useful when it has meaning and context associated with it. The most common way to bring context to data is by applying metadata (description and documentation of your data) and through supplementary files, such as a data dictionary. Documenting your data is important for sharing your data, in order for other researchers to understand how to access, view, and possibly re-use your data.
Describing and Documenting Your Data:
- Data dictionaries should provide the key information about the data that you will be collecting, and is used to explain what the variable names and values in a dataset really mean. Data dictionaries are most commonly used when working with tabular data or creating a database. The OSF provides a tutorial on how to make a data dictionary for tabular data.
- Example: USDA’s National Agricultural Library
- README files are documents in plain text (.txt) or markdown (.md) format that are often used to describe software packages, programming scripts, and datasets, and can also be used for research projects. It should include information about the creators of the files that it is describing, a list of the files included in the set, relevant funder information, and any associated research outputs, such as articles or presentations. The README should include a citation for the dataset, as well as for any of the byproducts of the research data that was collected and used. For more information about creating a README, see Cornell’s “Guide to Writing ‘README’ Style Metadata.”
- Example: README for a dataset.
- A data paper differs from a research paper in that it is “used to present large or expansive data sets, accompanied by metadata which describes the content, context, quality, and structure of the data” (Ecological Society of America). The Ecological Society of America provides a guide on writing a data paper.
- Example: Scientific Data from Nature is a publisher of data papers
- A codebook provides descriptions and definitions about the variables and values included in a dataset to assist users in interpreting the data for potential replication or reuse. Codebooks provide variable names and a description for what each variable represents, each variable’s type, the format that the values for each variable should be in, and the range of values, if applicable.
- Example: ICPSR
- Metadata is the describing and documenting of your data. There are different ways to do this, depending on the discipline that you are working in and the types and formats of data that you are collecting. The method that you use to describe your data will depend on the project, your team, and the complexity of your data. The documentation for your data should contain the minimum information required to be able to reuse the data that is being described.
- Following disciplinary metadata schemas gives you the opportunity to describe your project, its data, and other outputs, such as publications. When possible, you should follow a disciplinary standard that is common in your field. There are many metadata schemas that are specific to disciplines as well as the format of data being collected.
- Natural Science: Darwin Core, Ecological Metadata Language (EML), Biodiversity Information Standards (TDWG)
- Social Science: Data Documentation Initiative (DDI)
- Geospatial: Content Standard for Digital Geospatial Metadata (CSDGM), ISO 19115, FGDC Information About Geospatial Metadata, EPA Geospatial Metadata Technical Specification
- Arts and Humanities: Categories for the Descriptions of Works of Art, Dublin Core, Public Broadcasting Core (PBCore), Text Encoding Initiative, Visual Resources Association (VRA)
The exact details of how you will document your data may not be established until after the grant has been funded, but should be established before the start of data collection.
Writing Prompts:
- How will you describe and document your data?
- Will you be using a metadata schema(s)?
This content was adapted from Iowa State University Library’s Data Management Plan Guide.