Describe yourself and your role(s) on campus.
I just finished my first year in the master’s program at the School of Library and Information Studies. In addition to my classes, I work on a project studying the sustainability of social science data archives, at the Steenbock reference desk, and worked at Wendt as they move most of their physical materials to storage. What I’m doing this summer—working with a lab group on getting their data in order—has been one of my favorite experiences.
Describe your work with the Yin Lab.
One of the methods researchers in the Yin Lab use is fluorescence microscopy. Their experiments generate about 100GB of raw, unprocessed data. They do multiple processing steps that can lead to an additional 300GBs of data (some of this can be deleted, but…it can be scary to delete something that took you hours to create…) A few years of running these experiments means they have an enormous amount of data. Running out of space was what prompted the Yin Lab to seek help, but the real issue is that they don’t have a handle on what data they have and where it is. This is especially a problem given requirements from granting agencies and publishers that require researchers to be able to respond to questions about their work.
What excites you about working with researchers in an embedded role?
The Yin Lab is part of the Department of Chemical Engineering. I did a PhD in chemical engineering at University of Minnesota and working in the Yin Lab has felt like coming home for me. I am enjoying the opportunity to really get my hands on their data issues to figure out what’s going on and what needs to change. It’s a chance to use what I’ve learned about data management, plus a chance to try out some ethnographic research methods!
What are some of the challenges you are facing in retroactively identifying and organizing data?
There are the typical challenges with using other people’s paper lab notebooks: bad handwriting, notes in languages other than English, unclear abbreviations. The real problem is the lack of dates and version tracking on digital files. It’s really hard to tie the information from the lab notebook to the analysis of the experimental data without some link between the computer files and the paper notebook. Even harder is to figure out what data goes with published papers. A paper published in 2015 might use data from various experiments conducted anytime between 2010 and 2015.
What do you think labs should be doing to proactively combat such challenges?
It’s so sad to think of all of the orphaned data left in the wake of graduate students who graduate! I know all of the data from my PhD research is sitting on an external hard drive somewhere in my advisor’s office. In some sense, it’s not a huge deal that my data is essentially lost, because the growth of computing power since I left makes my work pretty outdated. But perhaps if it were easily accessible, it would have been of some use to students after me.
The other thing that lab groups need to be thinking about are the legal requirements for keeping track of their data. Sharing data has been a requirement for at least some NSF divisions since the 70s, which obviously necessitates that data be stored in a way that means it’s able to be understood by people other than who created it. Both NIH and NSF now require that all data be made publicly available, and UW-Madison has a policy requiring PIs to keep track of their data, if only in the event of allegations of academic misconduct. Papers aren’t just retracted because of falsified data (something the Yin Lab would never do!), but retractions also happen if the authors can’t provide supporting data in the event of a question about their work.
What are your top three favorite data management suggestions?
- Use standard naming conventions and name everything clearly. Excel files with tabs labeled “Sheet1,” “Sheet2,” and “Sheet3”—not so easy to understand. It should also go without saying that if you’ve got a spreadsheet of data, the columns and rows should be labeled in some way.
- Organize digital data by date of the experiment. This makes it possible to tie the raw image files to the eventual analytical results.
- Keep careful track of what data goes into publications. This could be either using a file to track which versions of which files are behind each figure and table, and/or adding an extension to the file name to indicate that it’s the published version.