RDS and Dr. Michelle Harris of the Biocore Honors Program partnered to bring research data management into the classroom and introduce it as a lifelong research skill to her undergraduates. With the advent of federal funding requirements and a general shift toward sharing and reproducibility, good data management is a critical skill for students. Introducing these concepts early on gives students a chance to adapt them as habits and incorporate them into their workflows. Below you’ll find a discussion of our approach to teaching data management this semester and how we can adapt these exercises to your classroom or lab.
Weren’t able to make it to our most recent talk? Check out the slides here. If you have any questions or comments, please contact RDS.
By Cameron Cook, Digital Curation Assistant and SLIS Graduate Student
On August 6th, 2015 I attended the afternoon workshop of the UW-Madison Teaching and Learning Retreat, which was led by University of Chicago’s Institutional Repository Manager, Amy Buckland. The focus of the daylong retreat was the intersections of scholarly communication and information literacy. Amy’s talk narrowed in on issues of public access and libraries’ role in scholarly communication – both as content consumers and content creators.
What then, you might ask, does Amy’s talk have to do with researchers and the purpose of Research Data Services? The answer is something very simple but a key concept for all of us involved in research and research data to move forward with in mind. It is that, as Amy said, “the new normal will be public access.”
By Luke Bluma, IT Engagement Manager for the Campus Computing Infrastructure (CCI)
Data is a critical part of our lives here at UW-Madison. We collect, analyze, and share data every day to get our jobs done. Data comes in all shapes and sizes and it needs the right place to live. That’s where storage comes in.
However, storage can be a loaded term. It can mean a thumb drive, or your computer’s hard drive, or storage that is accessed via a server or cloud storage or a large campus-wide storage service. It is all of these things, but not all of these will fit your needs. Your needs are what matters and they will drive what solution(s) will work for you.
I am the Engagement Manager for the Campus Computing Infrastructure (CCI) initiative. I work with campus partners on their data center, server, storage and/or backup needs. Storage is currently a big focus for me, so I wanted to share some thoughts about evaluating potential storage solutions.
The main areas to think about are:
- What kinds of data are you working with?
- What are your “must have’s”?
- What storage options are available at UW-Madison?
What kinds of data are you working with?
This is the first big question you want to focus on because it drastically impacts what options are available to you. Are you working with FERPA data, sensitive data, restricted data, PCI data, etc.? Each of these will impact what service(s) you can or can’t utilize. For more information on Restricted Data see: https://www.cio.wisc.edu/security/about/campus-initiatives/restricted-data-security-standards/
What are your “must have’s”?
Once you have identified the types of data you are working with, then it is crucial to determine what are your must have requirements for a storage solution. Does it need to be secure? If so, how secure? Does it need to be accessed by people outside of UW-Madison? Does it need to be high performance storage? Does it need to scale to 20+ TB? Does it need to be accessible via the web? These are just example questions, and the key here is that there is no perfect storage solution. Some services do X, Y, Z and others do X, Y, A but not Z. So determining your “must have’s” will help you figure out which services you can work with, and which you can’t.
What storage options are available at UW-Madison?
Now that you have identified the kinds of data, and the “must have’s” for your solution the final step is to evaluate what storage options are available to you at UW-Madison. Storage is an evolving technology so specific services will change over time, but here are good places to start to learn more about what services are available to you:
- Local IT – if you have a local IT group, then talk to them first about what local options may be available to you
- Campus Computing Infrastructure (CCI) – if you need network storage or server storage that isn’t focused on high performance computing then CCI has several options that could work depending on your needs
- Advanced Computing Initiative (ACI) – if you need to do high performance or high throughput computing then ACI has several options that could work depending on your needs
- Division of Information Technology (DoIT) – if you need cloud storage, like Box.com, or local storage, like an external hard drive, then DoIT has solutions that could work for you as well
This can seem like a lot to think about, and to be honest it can be quite confusing at times. The good news is that you have help! Research Data Services (RDS) can be a great starting point for your storage needs. We can focus on the key question: what are you looking to do? Then we can help you evaluate some potential options for moving forward based on your needs.
by Cid Freitag, Instructional Technology Program Manager at DoIT Academic Technology
If the data you need still exists;
If you found the data you need;
If you understand the data you found;
If you trust the data you understand;
If you can use the data you trust;
Someone did a good job of data management.
Rex Sanders ‐ USGS‐Santa Cruz*
Data management practices have been described in detail in a variety of documentation and tutorials, which may focus on specific needs and resources applicable to the organization that produced them. The following is a selected list of resources that are general enough to apply to different disciplines, and more broadly than the university or agency that developed them.
Guides and Tutorials
- The University of Washington offers a well organized, comprehensive data management guide. Most of the resources listed are publicly available.
- Georgia Tech’s guide includes a webpage that aggregates the data management requirements of several federal funding agencies. Learn about data management requirements.
- Multiple authors contributed to the short guide“10 Simple Rules for the Care and Feeding of Scientific Data” which offers practical advice for researchers on practices they can follow to manage their data for sharing and reuse.
- The USGS Data Management Training Modules are tailored to the needs of the USGS, but many of the practices are applicable to any discipline.
- In particular, three short narrated tutorials give overviews of the value of data management, planning, and best practices for preparing data to share.
Data Science MOOCs
Several Massively Open Online Courses cover topics related to data analysis and research methods. Even if you choose not to do the coursework and earn a statement of completion, it’s easy to sign up for the courses, which gives you access to lectures and examples.
The Class Central website has curated a list of several data science and analysis methods MOOCs, developed by reputable sources.
The MOOCs listed here have been developed through Johns Hopkins University, and offered through the Coursera platform. They are part of a Data Science Specialization series of of courses, and have applicability to data management practices outside of specific analytical techniques. Each of these courses lasts 4 weeks, and are frequently offered. Currently, there is a new offering of each course starting each month from March through June, 2015.
The Data Scientist’s Toolbox, Jeff Leek, Roger Peng, Brian Caffo
“The course gives an overview of the data, questions, and tools that data analysts and data scientists work with.” It focuses on a practical introduction to tools, using version control, markdown, git, GitHub, R, and RStudio.
Getting and Cleaning Data, Jeff Leek, Roger Peng, Brian Caffo
“This course will cover the basic ways that data can be obtained…..It will also cover the basics of data cleaning and how to make data “tidy”… The course will also cover the components of a complete data set including raw data, processing instructions, codebooks, and processed data. The course will cover the basics needed for collecting, cleaning, and sharing data.” Tools used in this course: Github, R, RStudio
Reproducible Research, Jeff Leek, Roger Peng, Brian Caffo
“Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them…This course will focus on literate statistical analysis tools which allow one to publish data analyses in a single document that allows others to easily execute the same analysis to obtain the same results.” Tools: R markdown, knitr
*Rex Sanders quote from: Environmental Data Management: CHALLENGES AND OPPORTUNITIES, Jamie Gerrard | March 2014
Looking for additional information about research data management? Contact us.
By Erin Carrillo, Information Services Librarian, Steenbock Library
In November, RDS held a two day data management workshop for graduate student researchers. Participants were from several departments across campus, including Limnology, Entomology, Forest and Wildlife Ecology, Geography, and the Nelson Institute for Environmental Studies, and were part of a cohort of graduate students doing research in the area of biodiversity conservation, funded by an NSF Integrative Graduate Education and Research Traineeship grant.
We planned the workshop with two graduate students, Kara Cromwell (Zoology) and Alex Latzka (Center for Limnology), who saw a need to provide new researchers with the knowledge and skills to navigate the changing research data landscape. From funder and publisher requirements for data management plans and data sharing, to the ongoing development of metadata standards and discipline-specific data repositories, researchers need to be aware of trends within their discipline and practice good data management from the outset. Kara and Alex also wanted to encourage and facilitate the sharing of research data within the group.
The workshop addressed several broad topics within data management, but content was tailored to the specific needs of the group. We administered a survey to the group at the beginning of the planning process to gauge students’ current knowledge of data management practices, as well as their specific needs. We identified several areas of focus, and modules were developed for each area. Stephanie Hampton, a visiting scientist coming from Washington State and former deputy director of NCEAS (National Center for Ecological Analysis and Synthesis), was invited by grad students in the Center for Limnology. She had recently published a few high impact papers on the future of ecology, especially with respect to Big Data, and gave a short talk giving participants perspective on why sound data management will matter as they advance in their careers.
The final program was:
- Spreadsheets, Jan Cheetham, DoIT Academic Technology and Barry Radler, Institute on Aging
- File Organization, Elliott Shuppy, School of Library and Information Studies (SLIS)
- Storage & Preservation, Brianna Marshall, Digital Curation Coordinator; Luke Bluma, DoIT Storage & Backup; Elliott Shuppy
- Metadata, Corinna Gries, Center for Limnology, North Temperate Lakes Long Term Ecological Research (LTER)
- Data Management Plans, Corinna Gries
- Keynote talk by Stephanie E. Hampton, Kaeser Scholar, Washington State University, Director of the Center for Environmental Research, Education, and Outreach
We built in designated work time at the end of the first day to give participants an opportunity to apply what they had learned and collaborate with their colleagues. Module presenters were available to answer questions. Presenters deposited slide decks and other workshop materials in a Box folder that we shared with participants after the workshop.
We had participants complete a pre- and post-workshop survey to assess the effectiveness of the workshop. The results revealed that participants generally rated their ability to practice good data management higher after the workshop. We also got this positive feedback from Kara:
“Alex and I heard a lot of positive feedback throughout the workshop… The schedule flowed smoothly, the content was very well suited to the needs of the group, and all the modules were engaging. We really appreciate the time you invested, and I know everyone (including many who weren’t able to attend) will continue to take advantage of the resources posted in the Box folder. It was a definite success!”
It was a pleasure to work with Kara and Alex and their group, and we look forward to using what we learned from planning this workshop to organize similar workshops tailored to the needs of researchers in different disciplines across campus.
Is your lab or department interested in working with RDS to develop a discipline-specific data management workshop? Contact us.
by Elliott Shuppy
Research data management has quickly grown into a necessity for librarians on the UW-Madison campus. We understand that this topic can be complex and intimidating, so we wanted to provide resources on some of the most important topics that librarians may be curious about. Compiled below are links for liaisons to explore, reference, and further equip themselves for reference inquiries and conversations around data.
What is data?
This might be a scary question to some, but one with very important implications. See how Minnesota and Oregon have responded.
- University of Minnesota: http://www.lib.umn.edu/datamanagement/whatdata
- University of Oregon: http://library.uoregon.edu/datamanagement/datadefined.html
Why manage data?
MIT and Minnesota lay out plainly the benefits of data management for researchers.
- MIT: http://libraries.mit.edu/data-management/plan/why/
- Minnesota: http://www.lib.umn.edu/datamanagement
What is a data management plan?
These links provide fairly comprehensive lists of required components and descriptions of data management plans.
- University of Minnesota: http://www.lib.umn.edu/datamanagement/DMP
- MIT: http://libraries.mit.edu/data-management/plan/
Questions to ask
Helpful sets of questions for librarians to consider when conducting data-related interviews with patrons can be found in the below links.
- RDS: http://researchdata.wisc.edu/wp-content/uploads/2010/04/data_plan_guide.pdf
- Purdue University: http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1092&context=lib_research
- MIT: http://libraries.mit.edu/data-management/plan/write/
Terms & definitions
Both Minnesota and Data One offer extensive glossaries of useful terminology for anyone dealing with data matters.
- University of Minnesota: http://www.lib.umn.edu/datamanagement/whatdata#What4
- Data One: http://www.dataone.org/sites/all/documents/DataONE_BP_Primer_020212.pdf
Federal requirements for data
In early 2013, the White House Office of Science and Technology Policy (OSTP) released a mandate requiring public access for federally funded research data. The Department of Energy was the first of many departments to release its requirements for researchers, which take effect October 1, 2014.
- An overview of the OSTP mandate: http://wapo.st/1v8kKnj
- DOE’s Public Access Plan: http://www.energy.gov/downloads/doe-public-access-plan