Many researchers use spreadsheets to collect, analyze and archive their research data, but spreadsheets are notoriously poor data management tools that are subject to common and costly errors. As part of an ongoing series by RDS that examines the use of spreadsheets in research, the following slide deck (RDS_brownbag_20140313) is from a presentation by RDS personnel advocating for best practices for use of spreadsheets in quantitative research. The end of the presentation contains a demonstration of a recently developed tool that documents Excel spreadsheets according to a metadata standard called the Data Documentation Initiative.
The RDS group advocates for good data management practices at UW-Madison. Data management is a topic that is attracting more and more attention in the era of Big Data. One ubiquitous practice among researchers and analysts in academia and business is the use of spreadsheets for data entry, storage, and analysis. While the use of spreadsheets in research is widespread, there are few guidelines for such use. Indeed, spreadsheets pose some troublesome issues from the perspective of documenting and managing research data. The topic of spreadsheet use in research has gained quite a bit of traction since Spring, 2013, when a controversial and widely-cited academic paper on government debt and growth was shown to be based on a faulty Excel dataset.
Prompted in part of this and other related events, RDS has recently updated its recommendations on using spreadsheets in research data management. Another great resource to consult before deciding to use spreadsheets for your research is a primer on using Excel for data entry assembled by the UW Social Science Computing Cooperative. Some tools that can potentially improve the documentation of spreadsheet data and analysis are Colectica for Excel and Data Up.
A new review of an influential research article on fiscal austerity and GDP finds that the results were tainted in part by an undocumented error in the authors’ Excel dataset. The original research by Carmen Reinhart and Ken Rogoff was titled “Growth in a Time of Debt” claimed that economic growth slowed quite dramatically for countries whose public debt crossed a threshold of 90% of Gross Domestic Product. Since its publication, this finding has often been cited in stimulus/austerity debates, but many economists were unable to replicate it, in part because of the authors’ reticence to share their original data.
The authors of the new review were able to obtain the original data and found a number of problems in the analysis, which are well summarized in this blog post. This episode stands as a cautionary tale about proper data management and open access; these issues are finally being recognized as critical to the integrity of science.
The book–which uses R code to illustrate examples–begins with a clear definition of Data Science:Data Science refers to an emerging area of work concerned with the collection, preparation, analysis, visualization, management and preservation of large collections of information.
Documenting research data and processes is becoming highly relevant in the age of Big Data. The Data Documentation Initiative (http://www.ddialliance.org/what) is an effort to standardize how social science metadata are described, thus leading to more efficient discovery and analysis of data.
Recently, a workshop was held in Germany to expand the scope of DDI and make it simpler to use. To those ends, DDI plans to adopt a model-based specification that can be expressed in XML, RDF/OWL technology, relational database schema, and other languages. To broaden its appeal beyond a programmer and software developer audience, it was decided that DDI needs to avoid jargon and use terminology that is familiar to social science researchers and data librarians. Please contact RDS if you are interested in applying DDI to your research project.