Spreadsheets and Databases as Tools for Data Management

Is the spreadsheet you’re using for research data starting to get unwieldy? 

For research projects of any size it’s important to think about the tools that will best suit your data collection, recording, and analysis needs now and into the future. Spreadsheets and databases are two of the most versatile and widely used tools for managing research data. However, to ensure the long term integrity of your data, there are many scenarios in which the limitations of spreadsheets mean it may be time to transition your research data to a database. 

Spreadsheets:

Spreadsheets, whether in a proprietary software like Microsoft Excel or open source applications like OpenOffice, are some of the most widely used data management tools. This is not by accident: spreadsheets are an easy way to get started storing both qualitative and quantitative data. For larger research projects or those with highly specialized data types and vocabularies, however, spreadsheets have limitations that can jeopardize data integrity and introduce errors into your research process. 

Benefits of Spreadsheets: 

  • They are easy to learn and many people already know many of their functions 
  • They provide flexible and customizable features for organizing and analyzing data
  • Many other applications and coding languages work seamlessly with data stored in spreadsheets
  • Data in spreadsheets can be easily converted into csv or tabular formats that are perfectly suited for long term storage and sharing

Drawbacks of spreadsheets:

  • Limited control over data integrity and format
  • Some proprietary spreadsheet software have limits on the amount of data they can house
  • Most importantly, it’s easy for humans to introduce errors into data through mistakes like accidentally overwriting cells or by recording data in non-normalized structures (e.g. Excel famously mangles dates as cited in this blog post by a data librarian.)

For more information, consider taking a look at this Data Carpentries course on how to use spreadsheets for data management.

Databases: 

Databases are among the most powerful and ubiquitous tools for managing data of all kinds because of their dynamic ability to organize, query, and share the data they contain. There are both proprietary and open source database software that can serve as effective solutions for managing your research data. In addition, data that is already stored in a spreadsheet can, in many cases, be integrated into a database system

Benefits of databases: 

  • Databases provide highly customizable controls over formats, vocabularies, and data types according to which data is entered and stored
  • Easy to enforce quality control by requiring specific formats or values for fields and limiting access to tables
  • There are many Database Management Systems (DBMS) like MySQL Workbench or Microsoft Access that support the design and implementation of databases without extensive coding know how
  • Databases facilitate complex relationships between data elements and support sophisticated queries 
  • Command line controls make it possible to implement detailed customization of database structure 
  • Databases often include customizable interfaces that provide access to data as well as forms for entering new data

Drawbacks of databases: 

  • Learning database languages such as SQL can be labor and time intensive
  • Planning databases often requires extensive planning and may slow the research process down
  • Data stored in proprietary database software can be difficult to convert to other formats

Where to get help: If you are interested in further exploring database solutions you can access database training courses on LinkedIn Learning (formerly Lynda.com) or you can consult with Teach and Research Application Development (TRAD) on your options.

 

Building a Practical DM Foundation

5070_Lab_microscope_originalBy Elliott Shuppy, Masters Candidate, School of Library and Information Studies

In addition to being an active research lab on the UW-Madison campus, the Laboratory for Optical and Computational Imaging (LOCI) initiates quite a lot of experimental instrumentation techniques and develops software to support those techniques. One major database platform development is OMERO, which stands for Open Microscopy Environment Remote Object. OMERO is an open, consortium-driven software package that is set up with the capabilities to view, organize, share, and analyze image data. One hiccough is that it’s not widely used at LOCI.

Having identified this problem, my mentor Kevin Elicieri, LOCI director, and I thought it would be a good idea for me to develop expertise in this software as a project for ZOO 699 and figure out how to incorporate it into a researcher workflow at LOCI. On-site researcher Jayne Squirrel was the ideal candidate as she is a highly organized researcher working in the lab, providing us an excellent use case. Before we could insert OMERO into her workflow, we had to lay some formal foundational management practices, which will be transferable in her use of OMERO.

We identified four immediate needs:

  • Simple and consistent folder structure
  • Identify all associated files
  • ID system that can be used in OMERO database
  • Documentation

We then developed solutions to meet each need. The first solution was a formalized folder structure, which we chose to organize by Jayne’s workload:

Lab\Year (YYYY)\Project\Sub-project\Experiment\Replicates\Files

This folder structure will help organize and regularize naming of files and data sets not only locally and on the backup server, but also within the OMERO platform.

In order to identify all files associated with a particular experiment we developed a unique identifier that we termed the Experiment ID.  This identifier will lead file names and consists of the following values: initial of collaborating lab (O or H) and a numerical sequence based on current year, month, series number of experiments, and replicate.

Example: O_1411_02_R1

The example reads Ogle lab, 2014, November, second experiment (within the month of November), replicate one. Incorporating this ID into file names will help to identify and recall data sets of a particular experiment and any related files such as processed images and analyses.

Further, both the file organization and experiment ID can aid organization and identification within OMERO.  The database platform has two levels of nesting resolution.  The folder is the top tier; within each folder a dataset can be nested; each dataset contains a number of image data. So, we can adapt folder structure naming to organize files and datasets and apply the unique identifier to name uploaded image objects.  These upgrades make searching more robust and similar in process to local drive searches.

Lastly, we developed documentation for reference. We realized that Experiment ID’s need to be accessible at the prep bench and microscope.  We subsequently created a mobile accessible spreadsheet containing information on each experiment. We termed this document the Experimental Worksheet and it contains the following information:

  • Experiment ID
  • Experiment Description
  • Experiment Start Date
  • Project Name
  • Sub-project Name
  • Notes

This document will act as a quick reference of bare bones experiment information for Jayne and student workers. Too, we realized that Jayne’s student workers need to know what the processes are in each step of her workflow. So, we developed step-by-step procedures and policy for each phase of the workflow. These procedural and policy documents set management expectations and conduct for Jayne’s data. Now, with such a data management foundation laid, the next step is to get to our root problem, discern how Jayne can best benefit from using OMERO and where it makes sense in her workflow.