Metadata - Research Data Services

Metadata consists of documentation of your data and related tools/processes that track what you have done throughout the research project. This information provides context for collaborators during the project and for when you publish or share your datasets. You can think of metadata as the basic information necessary for someone to understand and interpret your results. Metadata can come in a variety of forms including:

Identifiers: This is information that links elements of your research together and helps with citability. It includes information like your ORCID id, the DOI for your dataset or the associated paper, and your grant ID.
Accessibility: Information governing the access and reuse of your data such as a license. Information about the type of tools needed to access your data.
readme.txt files: Descriptive text file with basic information about your project and methodology.
Data dictionary: A file that explains the contents of your dataset.
Codebook: Provides information about data from a survey instrument.
Metadata standards: There are a variety of standards for what you need to document about your data and what formats to use. Some standards are field specific and some are more general.
Metadata from tools and repositories: Some tools and repositories collect and generate their own metadata.

Readme Files

Readme files are an important form of documentation. Creating and using them during your research project can help you track your work as you go. This will facilitate collaboration with others and make things easier when you publish/share your data.

When you include a readme file with data you have shared in a repository, it will allow others to more easily understand your data and to potentially reuse it, depending on licenses or permissions.

We suggest using Cornell’s “Guide to writing “readme” style metadata” as a template you can adapt to your own needs.

Data Dictionaries

A data dictionary is a file that documents and describes the various elements of your dataset. It can be the key to understanding your data and what it shows. For example, if you are collecting tabular data, a data dictionary would list all of the fields in the table, what they represent, and measurement information. Using a data dictionary can keep data collection consistent across a project by defining key elements such as labels, units, and constraints.

If your data includes code, a data dictionary would provide information about how the code relates to the dataset and any technical requirements. A data dictionary can make it easier for software to process a data file by providing information to the software such as column names, type of data in each column, specifications, use of nulls, etc.

For more information on creating a data dictionary, we recommend Open Science Framework’s “How to Make a Data Dictionary” and Smithsonian Libraries’ “Describing Your Data: Data Dictionaries.”

Codebooks

Survey researchers use codebooks to provide information about the data from their survey instrument. Codebooks share information such as the response codes for survey responses, variable names, and other details.

While some tools may provide you with a codebook, with other tools you may have to either create one yourself or add to what the tool generates for you. Both codebooks and data dictionaries facilitate integration of datasets from different sources.

We recommend ICPSR’s “Guide to Codebooks” as a starting point for learning more about codebooks.

Metadata Standards

A sample of the Ecology Metadata Language (EML) standard

Metadata standards specify what pieces of information are included and how they are expressed in digital files. Some are generic enough to be useful across a wide array of disciplines, while others are highly specific to disciplinary areas. You may select a metadata standard based on the discipline that you’re working in, or the type of data that you’re working with.

We cannot provide a comprehensive list here. Instead, we include examples in broad disciplinary areas, plus a “general” category. Where possible, we selected examples that appear to have broad adoption within or across disciplinary areas.

Disciplinary area	Metadata standard	Description
General	Dublin Core	Widely used in disciplinary and institutional repositories.
	Disciplinary Metadata from the DCC	Searchable list of disciplinary metadata standards and related information. Includes biology, Earth science, physical science, social science & humanities and general research data.
	Altova Schema library	A reference library to common (and uncommon) industry and cross-industry schemas.
Life Sciences	Darwin Core	Designed to facilitate the sharing of information about biological diversity. It is primarily based on taxa, their occurrence in nature as documented by observations, specimens, and samples and related information.
Life Sciences	EML (Ecology Metadata Language)	Maintained by the Ecological Society of America. Consists of XML modules that can be used to document ecological datasets.
Humanities	Seeing Standards: A Visualization of the Metadata Universe	Information on 105 cultural heritage metadata standards.
	TEI (Text Encoding Initiative)	A widely-used standard for representing textual materials in XML.
	VRA (Visual Resources Association) Core	A metadata standard for works of visual culture and the images that document them.
Social Sciences	DDI (Data Documentation Initiative)	A metadata specification for the social and behavioral sciences was created by the Data Documentation Initiative, and is used to document data through its lifecycle and to enhance dataset interoperability.

Getting help with metadata

A data specialist from one of the following groups may be able to help you find, adapt, and use an appropriate metadata standard.

An informatics specialist or IT consultant in your department.
An RDS consultant.
The subject librarian for your department.
A disciplinary society in your research area.