Spreadsheets are software tools for analyzing and modeling mathematical relationships, and producing charts and graphs from numeric data.
Many research tools and instruments output data in spreadsheet formats. Because they are relatively simple to set up and use, researchers in many disciplines use spreadsheets to store and structure numeric and text data. However, several well-publicized situations in which spreadsheet errors led to problems with the reproducibility of published data highlight the importance of following some basic guidelines when using programs like Excel to enter, store, and analyze research data.
Following good data management practices is crucial to ensuring that your data or analysis can be interpreted and reproduced, both by you and by other parties. To that end, we recommend that you:
- Save a copy of each original spreadsheet file in its original state before you make any alterations or transformations. Label these as RAW data files. This ensures that any analysis or formatting steps do not destroy your original data source.
- Optimize the machine-readability of your spreadsheet file:
- Use consistent terms, spelling, and formats for variable names (column headings) and data values.
- Avoid spaces and non-alphanumeric characters in variable names and data values.
- Don’t use color to indicate data values.
- For data files, don’t leave cells, rows, or columns blank (use a character to indicate blank/null/invalid values).
- Use one worksheet per data set.
- Create a separate human-readable document (such as a .txt file) that defines variables and data values and records formulas and specific steps used to analyze the spreadsheet data. Recording this information directly in the spreadsheet is not recommended. Plug-ins for Excel (e.g. Colectica) can be used to create separate documentation based on the contents of your spreadsheet.
- Become proficient with spreadsheet software. Take a tutorial or course. The more you know about the advantages and limitations of specific spreadsheet software, the better you can avoid common pitfalls that can result in errors.
- Spreadsheet software can undergo frequent changes as new versions are released. To protect the longevity of your spreadsheet data, export and archive versions in sustainable formats, such as CSV. However, be aware these formats “flatten” spreadsheet data, removing tabs and formulas, so ensure you export single worksheets and document formulas in a separate file, as described above.
- If you are using Excel to analyze the data, consider moving to a statistical program such as SAS, SPSS, Stata, or R instead. In addition to providing wider array of statistical procedures, these programs are designed to document all scripts, codes, and libraries used in the analysis making it easier to record and reproduce transformations and analysis.
Learn More About Using Spreadsheets
- Using Excel for Data Entry in Research from the Social Sciences Computing Cooperative.
- Learn how to create Excel data files for subsequent analysis in statistical packages (Stata, SAS, SPSS, R, etc.)
- Statistical software classes at the Social Sciences Computing Cooperative
- Excel classes
- “Teaching Students to Document Their Empirical Research” (2011) Ball, RJ and Medeiros, N. Social Science Research Network.
Best practices for documenting social sciences research that involves computer code, statistical programs like R, and saving unaltered spreadsheets as raw data. Their recommendations are applicable to many other disciplines, as well.
- “Blasted Cell Line Names” (2010) Coombes et al. Cancer Informatics 9, 251-255.
Sample names used inconsistently in spreadsheets (white space or adding prefix or suffix) can lead to mismatches when algorithms are later used to transform data.
- “Researchers Finally Replicated Reinhart-Rogoff, and There Are Serious Problems” (2013).
Excel errors (incomplete range of cells for an applied formula) omitted 5 countries from a key economic indicator in a highly influential paper about economic growth in countries under debt.