Top 5 Data Management Tips for Undergraduates

7NWB2A8I0R

by Cameron Cook

With fall well under way on campus and final projects just around the corner,  it’s the perfect time to review our top five data management tips for undergrads! As an undergraduate, data management may not seem important, but giving it a few moments of your day will ensure your assignments are safe – even in the face of a hard drive meltdown the night before a due date.

If keeping your final projects safe isn’t enough of an incentive, there is one more. You have undergraduate publishing opportunities. As you learn and grow as a researcher, you can publish your work in a number of undergraduate research journals. Practicing good data management will help keep your research reproducible, understandable, findable, and organized for when you submit your work to a journal.

 

1 ) Clear, consistent file naming and structure

Or know where your data lives. Keep file names simple, short, but descriptive. Include dates (in a standardized format) to version your files so that you can always go back to a previous copy in case of mistakes. Keep files in a consistent, clear structure with easy to follow labels (this may be date, file type, instrument or analysis type) so that you will never misplace an important file.

(more…)

Data Archiving Platforms: Figshare

by Brianna Marshall, Digital Curation Coordinator

This is part three of a three-part series where I explore platforms for archiving and sharing your data. Read the first post in the series, focused on UW’s institutional repository, MINDS@UW or read the second post, focused on data repository Dryad.

To help you better understand your options, here are the areas I will address for each platform:

  • Background information on who can use it and what type of content is appropriate.
  • Options for sharing and access
  • Archiving and preservation benefits the platform offers
  • Whether the platform complies with the forthcoming OSTP mandate

figshare

About

figshare is a discipline-neutral platform for sharing research in many formats, including figures, datasets, media, papers, posters, presentations and filesets. All items uploaded to figshare are citable, shareable and discoverable.

Sharing and access

All publicly available research outputs are stored under Creative Commons Licenses. By default, figures, media, posters, papers, and filesets are available under a CC-BY license, datasets are available under CC0, and software/code is available under the MIT license. Learn more about sharing your research on figshare.

Archiving and preservation

figshare notes that items will be retained for the lifetime of the repository and that its sustainability model “includes the continued hosting and persistence of all public research outputs.” Research outputs are stored directly in Amazon Web Service’s S3 buckets. Data files and metadata are backed up nightly and replicated into multiple copies in the online system. Learn more about figshare’s preservation policies.

OSTP mandate

The OSTP mandate requires all federal funding agencies with over $100 million in R&D funds to make greater efforts to make grant-funded research outputs more accessible. This will likely mean that data must be publicly accessible and have an assigned DOI (though you’ll need to check with your funding agency for the exact requirements). All items uploaded to figshare are minted a DataCite DOI, so as long as your data is set to public it is a good candidate for complying with the mandate.

Visit figshare.

Have additional questions or concerns about where you should archive your data? Contact us.

Data Archiving Platforms: Dryad

by Brianna Marshall, Digital Curation Coordinator

This is part two of a three-part series where I explore platforms for archiving and sharing your data. Read the first post in the series, focused on UW’s institutional repository, MINDS@UW.

To help you better understand your options, here are the areas I address for each platform:

  • Background information on who can use it and what type of content is appropriate.
  • Options for sharing and access
  • Archiving and preservation benefits the platform offers
  • Whether the platform complies with the forthcoming OSTP mandate

Dryad

About

Dryad is a repository appropriate for data that accompanies published articles in the sciences or medicine. Many journals partner with Dryad to provide submission integration, which makes linking the data between Dryad and the journal easy for you. Pricing varies depending on the journal you are publishing in; some journals cover the data publishing charge (DPC) while others do not. Read more about Dryad’s pricing model or browse the journals with sponsored DPCs.

Sharing and access

Data uploaded to Dryad are made available for reuse under the Creative Commons Zero (CC0) license. There are no format restrictions to what you upload, though you are encouraged to use community standards if possible. Your data will be given a DOI, enabling you to get credit for sharing.

Archiving and preservation

According to the Dryad website, “Data packages in Dryad are replicated across multiple systems to support failover, improve access times, allow recovery from disk failures, and preserve bit integrity. The data packages are discoverable and backed up for long-term preservation within the DataONE network.”

OSTP mandate

The OSTP mandate requires all federal funding agencies with over $100 million in R&D funds to make greater efforts to make grant-funded research outputs more accessible. This will likely mean that data must be publicly accessible and have an assigned DOI (though you’ll need to check with your funding agency for the exact requirements). As long as the data you need to share is associated with a published article, Dryad is a good candidate for OSTP-compliant data: it mints DOIs and makes data openly available under a CC0 license.

Visit Dryad.

Have additional questions or concerns about where you should archive your data? Contact us.

Building a Practical DM Foundation

5070_Lab_microscope_originalBy Elliott Shuppy, Masters Candidate, School of Library and Information Studies

In addition to being an active research lab on the UW-Madison campus, the Laboratory for Optical and Computational Imaging (LOCI) initiates quite a lot of experimental instrumentation techniques and develops software to support those techniques. One major database platform development is OMERO, which stands for Open Microscopy Environment Remote Object. OMERO is an open, consortium-driven software package that is set up with the capabilities to view, organize, share, and analyze image data. One hiccough is that it’s not widely used at LOCI.

Having identified this problem, my mentor Kevin Elicieri, LOCI director, and I thought it would be a good idea for me to develop expertise in this software as a project for ZOO 699 and figure out how to incorporate it into a researcher workflow at LOCI. On-site researcher Jayne Squirrel was the ideal candidate as she is a highly organized researcher working in the lab, providing us an excellent use case. Before we could insert OMERO into her workflow, we had to lay some formal foundational management practices, which will be transferable in her use of OMERO.

We identified four immediate needs:

  • Simple and consistent folder structure
  • Identify all associated files
  • ID system that can be used in OMERO database
  • Documentation

We then developed solutions to meet each need. The first solution was a formalized folder structure, which we chose to organize by Jayne’s workload:

Lab\Year (YYYY)\Project\Sub-project\Experiment\Replicates\Files

This folder structure will help organize and regularize naming of files and data sets not only locally and on the backup server, but also within the OMERO platform.

In order to identify all files associated with a particular experiment we developed a unique identifier that we termed the Experiment ID.  This identifier will lead file names and consists of the following values: initial of collaborating lab (O or H) and a numerical sequence based on current year, month, series number of experiments, and replicate.

Example: O_1411_02_R1

The example reads Ogle lab, 2014, November, second experiment (within the month of November), replicate one. Incorporating this ID into file names will help to identify and recall data sets of a particular experiment and any related files such as processed images and analyses.

Further, both the file organization and experiment ID can aid organization and identification within OMERO.  The database platform has two levels of nesting resolution.  The folder is the top tier; within each folder a dataset can be nested; each dataset contains a number of image data. So, we can adapt folder structure naming to organize files and datasets and apply the unique identifier to name uploaded image objects.  These upgrades make searching more robust and similar in process to local drive searches.

Lastly, we developed documentation for reference. We realized that Experiment ID’s need to be accessible at the prep bench and microscope.  We subsequently created a mobile accessible spreadsheet containing information on each experiment. We termed this document the Experimental Worksheet and it contains the following information:

  • Experiment ID
  • Experiment Description
  • Experiment Start Date
  • Project Name
  • Sub-project Name
  • Notes

This document will act as a quick reference of bare bones experiment information for Jayne and student workers. Too, we realized that Jayne’s student workers need to know what the processes are in each step of her workflow. So, we developed step-by-step procedures and policy for each phase of the workflow. These procedural and policy documents set management expectations and conduct for Jayne’s data. Now, with such a data management foundation laid, the next step is to get to our root problem, discern how Jayne can best benefit from using OMERO and where it makes sense in her workflow.

Let’s Talk About Storage

By Luke Bluma, IT Engagement Manager for the Campus Computing Infrastructure (CCI)

Data is a critical part of our lives here at UW-Madison. We collect, analyze, and share data every day to get our jobs done. Data comes in all shapes and sizes and it needs the right place to live. That’s where storage comes in.

However, storage can be a loaded term. It can mean a thumb drive, or your computer’s hard drive, or storage that is accessed via a server or cloud storage or a large campus-wide storage service. It is all of these things, but not all of these will fit your needs. Your needs are what matters and they will drive what solution(s) will work for you.

I am the Engagement Manager for the Campus Computing Infrastructure (CCI) initiative. I work with campus partners on their data center, server, storage and/or backup needs. Storage is currently a big focus for me, so I wanted to share some thoughts about evaluating potential storage solutions.

Storage Array in Data Center

Storage for CCI

The main areas to think about are:

  • What kinds of data are you working with?
  • What are your “must have’s”?
  • What storage options are available at UW-Madison?

What kinds of data are you working with?

This is the first big question you want to focus on because it drastically impacts what options are available to you. Are you working with FERPA data, sensitive data, restricted data, PCI data, etc.? Each of these will impact what service(s) you can or can’t utilize. For more information on Restricted Data see: https://www.cio.wisc.edu/security/about/campus-initiatives/restricted-data-security-standards/

What are your “must have’s”?

Once you have identified the types of data you are working with, then it is crucial to determine what are your must have requirements for a storage solution. Does it need to be secure? If so, how secure? Does it need to be accessed by people outside of UW-Madison? Does it need to be high performance storage? Does it need to scale to 20+ TB? Does it need to be accessible via the web? These are just example questions, and the key here is that there is no perfect storage solution. Some services do X, Y, Z and others do X, Y, A but not Z. So determining your “must have’s” will help you figure out which services you can work with, and which you can’t.

What storage options are available at UW-Madison?

Now that you have identified the kinds of data, and the “must have’s” for your solution the final step is to evaluate what storage options are available to you at UW-Madison. Storage is an evolving technology so specific services will change over time, but here are good places to start to learn more about what services are available to you:

  • Local IT – if you have a local IT group, then talk to them first about what local options may be available to you
  • Campus Computing Infrastructure (CCI) – if you need network storage or server storage that isn’t focused on high performance computing then CCI has several options that could work depending on your needs
  • Advanced Computing Initiative (ACI) – if you need to do high performance or high throughput computing then ACI has several options that could work depending on your needs
  • Division of Information Technology (DoIT) – if you need cloud storage, like Box.com, or local storage, like an external hard drive, then DoIT has solutions that could work for you as well

This can seem like a lot to think about, and to be honest it can be quite confusing at times. The good news is that you have help! Research Data Services (RDS) can be a great starting point for your storage needs. We can focus on the key question: what are you looking to do? Then we can help you evaluate some potential options for moving forward based on your needs.

To get started contact RDS at http://researchdata.wisc.edu/help/contact-us/ or contact me at cci@cio.wisc.edu

Manage Your Data with LabArchives

line beaker

By Jan Cheetham, Research and Instructional Technologies Consultant, DoIT

LabArchives is an ELN (Electronic Lab Notebook) that provides data storage, data documentation, collaboration, and export features. Like traditional paper lab notebooks, an ELN can serve as a continuous and complete record of the research process.

Basics

Collaboration and Sharing

LabArchives provides flexible permissions and roles for lab members and their collaborators. It is recommended that PI’s assume the Owner role in all their lab’s notebooks, in alignment with UW-Madison’s Policy on Data Stewardship, Access, and Retention and to ensure that no data is lost when lab members graduate or leave the university.

There are several approaches for organizing notebooks and managing edit/read rights of individuals. Permissions can be set at the level of the notebook, page, or entry. It also possible for individuals in the Owner or Admin role to share notebooks, pages, and entries with collaborators outside the university. Although LabArchives has a method for creating Digital Object Identifiers (DOIs) for notebooks, this requires making the notebook publicly available. The UW-Madison LabArchives site currently has the public sharing feature turned off as a security measure to prevent inadvertent sharing of notebooks.

The ELN provides a timestamp and record of every user action, creating an electronic record of who added or edited an entry and when. In addition, nothing can be permanently deleted from the ELN. ( LabArchives allows you to move a notebook, page, or entry to a Delete Bin; however, these items are not actually deleted and can be recovered at any time.)

Organizing and Documenting

The ability to blend digital data with the human readable narrative of the research process is one of the main advantages of an ELN over other file sharing/storage services or hybrid paper/electronic systems. LabArchives has a number of different entry types for entering data and recording the narratives. Below are a few suggestions that will help ensure that the information you enter in LabArchives can be readily retrieved.

Naming conventions
LabArchives currently does not offer a way to browse through folders or pages chronologically. Therefore, you may want to use file-naming conventions for pages (and possibly, folders). Names should contain a project name, date, experiment identifier, etc. For more specific suggestions, see naming conventions in an ELN.  It is also a good idea to use similar naming conventions for files you attach or link to in the ELN to make it easier to trace through versions and locate those with transformations.

Documenting attached files
In LabArchives, you upload and attach a single data file to an attachment entry on a page. The file can be of any type and up to 250 MB in size. The entry will display the name of the attached file and you can also enter a description with detailed information (metadata) about the file. When you upload a new version of the file to the same entry, LabArchives retains all prior versions and lets you revert back to older versions through the entry’s revision history. However, as noted below, only the most recent version is included in HTML export. Therefore, to ensure that all data files that you or someone else would need to reproduce your findings are archived both inside the ELN and in HTML exports, be sure to create a separate attachment entry for each essential file that needs to be retained in its original, unaltered form. Then, new versions of the data file (in which the original data are cleaned, transformed, analyzed, visualized, etc.) should be added to the ELN as one or more new entries.

Documenting linked files
When data files are too big (>250 MB) or too numerous to attach to the ELN, you can create links to them from within a rich text entry. However, LabArchives does not check links or verify locations, so you will need to ensure the files are in a secure and permanent location. It is also a good practice to record the name of the file and its location directly in the rich text entry since the URL you add when you create a link is not directly visible in the entry.

Exporting and Archiving

LabArchives has two export formats, PDF and HTML. The PDF version is similar to a scanned paper notebook page. The HTML version lacks some of the appearance of the notebook but contains more complete information, including attached files. As with any digital platform you use for your research data, you will want to have a backup and archival plan. This should take into account how often you make changes to the notebook and include methods for retaining duplicate copies of important data files in alternate locations.

PDF
PDFs can be created for a single entry or page or entire notebook. PDFs include: text entries, thumbnails of images and widgets, annotations and descriptions of attachments, user name and time/date stamps. They do not include: attached files, version history of attachments, or comments. URLs of links in rich text entries may be retrievable, depending on the application you use to read the PDF.

HTML
The HTML option exports an entire notebook. Each page in the notebook is a separate HTML file and the most recent version of each attached file is also included. This export option also does not include version history of attachments or comments. Again, URLs that you add to create links in rich text entries may be retrievable, depending on the browser you use to read the HTML pages.

Do you have additional questions or concerns about electronic lab notebooks? Contact us.

Data Archiving Platforms: MINDS@UW

by Brianna Marshall, Digital Curation Coordinator

This is part one of a three-part series where I explore platforms for archiving and sharing your data. To help you better understand your options, here are the areas I will address for each platform:

  • Background information on who can use it and what type of content is appropriate
  • Options for sharing and access
  • Archiving and preservation benefits the platform offers
  • Compliance with the forthcoming OSTP mandate

MINDS@UW

About

MINDS@UW is the University of Wisconsin’s institutional repository, intended to capture, archive, and provide access to scholarship originating from campus researchers of any discipline. It is supported by the UW Libraries and free for all UW-affiliated researchers to use. While a wide variety of file formats are supported, this platform is best suited to handling text-based formats.

Sharing and access

Items in the repository are given a permanent URL that can be used to share the item; however, DOIs are not minted at this time. Items can be made open access (accessed free of charge by anyone, anywhere, at any time) or they can be embargoed (no access is provided until a certain time, up to a few years, has passed). Embargoed items are still discoverable since the metadata is indexed in the repository but the content will not be visible.

Archiving and preservation

The Libraries are committed to long-term preservation of all MINDS@UW items. In addition to the current backup practices in place, the Libraries are collaborating with the UW-Madison Office of the CIO to design and pilot a campus-scaled digital preservation infrastructure. This service, and the libraries’ own preservation repositories, will eventually be aligned with the Digital Preservation Network (DPN).

OSTP mandate

The OSTP mandate requires all federal funding agencies with over $100 million in R&D funds to make greater efforts to make grant-funded research outputs more accessible. This will likely mean that data must be publicly accessible and have an assigned DOI (though you’ll need to check with your funding agency for the exact requirements). Because MINDS@UW cannot provide a DOI at this time, it is not a suitable place for funder data.

The UW Libraries are always looking to improve this platform to better fit the needs of researchers. If you have a question, comment, or suggestion related to MINDS@UW, please contact repository manager Brianna Marshall.

Visit MINDS@UW.

Have additional questions or concerns about where you should archive your data? Contact us.