Tools: Interview with Mark Igra, LabKey Server

Earlier this year, I spoke with Mark Igra, a partner at LabKey Software, and learned more about how LabKey Server works and how it’s used by researchers.

Q: What types of research is LabKey Server suited for?

Mark: LabKey Server helps teams of scientists bring together many different kinds of information from different sources for integrated analysis, secure sharing and collaboration.

Analysis across large datasets is a common need in fields that generate large volumes of data from high throughput techniques, such as proteomics and genomics. But geneticists also use LabKey Server to store phenotypes. Microscopists use it to document and point to high-resolution images. And clinical researchers use it to track diagnostic and other clinical data. Scientists across many fields of biomedical research face common challenges in managing, integrating and securely sharing their data.

An overview of the kinds of data, analyses, and collaborations Labkey server supports.

LabKey Server helps scientists integrate, analyze, and share many different kinds of research information through a secure web portal. Collaborators can only view the data they have permissions to see.

Q: What types of data files can be used for analysis in LabKey Server?

Mark: A wide range of data file formats, particularly tabular formats such as MS Excel spreadsheets.

Q: What’s involved in running an analysis on data files in LabKey Server?

Mark: LabKey Server provides web-based tools for analyzing and visualizing data. For example, you can use interactive data grids to filter, sort, and join tabular data from multiple experiments. You can also write R scripts to run analyses within LabKey Server and conduct SAS or SQL queries on data you are authorized to view.

This image shows a data grid in LabKey being filtered by treatment group.

Interactive data grids support sorting, filtering, adding/removing columns, data export, and a variety of visualization and analysis options, such as R scripting. This image shows a data grid being filtered by treatment group. A live view of this grid:

Screenshots showing the R script, the grid containing data used in the R analysis, and the plotted results of the analysis.

LabKey Server’s built-in interface for R scripting helps users create and share R-based analyses and visualizations through the web-based portal. Users with sufficient security credentials can explore alternative analyses by editing existing scripts and saving private copies. As shown here, source data, scripts and script results (“views”) are displayed on separate tabs. A live view:

A screenshot of a Chart Wizard in LabKey server showing how data types are filtered with checkboxes.

Chart wizards make it easy to produce interactive plots of results. This time-based chart shows progression relative to baseline for several cohorts. The checkboxes on the right allow users to filter the data displayed. A live version of this chart:

Q: How does analysis across spreadsheet data from multiple experiments or multiple labs work? What if each lab or experiment had a different way of naming columns or coding data values in spreadsheets?

Mark: It’s relatively easy to remap column names or set up aliases when you import each spreadsheet into LabKey Server. Inconsistent data types are a bigger problem. If the data values come from lookups, there are a few ways you can fix that by writing a script in LabKey Server. However, analysis across spreadsheet data always works best when spreadsheet data are simple and coded in consistent ways. When we meet help research groups set up their LabKey Servers, we usually help them define consistent ways of coding data and variable names.

Screenshot showing how values for data from pre-defined vocabularies are selected during data input.

To ensure standardized data entry, administrators can configure table fields as lookups to pre-defined lists of vocabulary. Users must then pick from a predefined list of terms when entering data in this field. The screenshot shows an example of how a field is configured as a lookup with a default value.

Q: How does LabKey Server differ from Electronic Lab Notebook software?

Mark: ELNs are based on the traditional lab notebook paradigm— a place to describe and store information about each experiment. LabKey Server is a tool for loading data and descriptive metadata in a structured way so you can compare and analyze across large volumes of data using the power of a database.

LabKey Server can be used like an ELN. For example, you can create data structures for specific experiment types, like a chemistry assay, then load data files from individual experiments, adding annotations about specific parameters for each experiment. LabKey Server then can read contents of the data files, perform transformations and visualizations across experiments, and populate the underlying database with the transformed data. It can also compare the quality of results from different experiments and show you any trends in quality due to differences in reagents or other conditions, as in the example below.

A screenshot showing how data quality from 10 experimental runs of the same assay is visualized in LabKey Server.

LabKey Server can help with experimental quality control by visualizing the progression of quality metrics over time. This figure shows a Levy-Jennings plot for a quality metric for a Luminex assay across 10 experimental runs, enabling early detection of problematic trends and outliers. A live version of this chart:

Q: So, researchers can write custom scripts for data analysis and other steps in LabKey Server. Does LabKey Server work like a code repository?

Mark: LabKey Server isn’t a code repository per se. For example, it doesn’t have built in versioning system for code. It does audit changes in configuration and security. And it can show you what code was used to run an analysis. Anyone who is writing code to implement on a LabKey Server should follow best practices for code versioning and use a code versioning application that is external to the LabKey Server.

Tools: Google Takeout

The UW-Madison has implemented a utility for exporting your files out of your UW Google Drive account (as well as YouTube and Google Contacts) in one step. This is useful for archiving files in your account if you are leaving the University or if you want a copy of the files to place in another location. Takeout doesn’t delete the files; it creates a copy, so if you need to delete them, you will need to do that directly in Google Drive.

See  Exporting Data Using Google Takeout for instructions on how to do this.

Google Takeout creates a zipped folder, named <yourNetID> which you can download from the browser.


DMPTool Webinar Series Continues


DMPTool Webinar Series Brown Bag

Join us for a ~15 part webinar series on the Data Management Planning Tool, DMPTool, from the California Digital Library.  This series will introduce the tool, discuss how to use it effectively, and describe how it can be customized for institutional needs.  Librarians, staff, and information professionals interested in promoting the use of the DMPTool by researchers are encouraged to attend.

DMPTool wiki

More information on the DMPTool webinar series.

Webinar 1: Introduction to DMPTool. Recorded May 28th.  Slides also available.

Webinar 2: Learning about data management: Resources, tools, and materials you can use. Recorded June 4th.  Slides and bibliography also available.

Webinar 3:  Customizing the DMPTool for your institution. Recorded June 18th. Slides also available

Webinar 4: Environmental Scan:  Identify stakeholders and partners in data management. Recorded June 25th.  Slides also available.

Webinar 5: Promoting institutional services with the DMPTool (EZID as an example). Recorded July 9th.  Slides also available.

Webinar 6:  Health Sciences & DMPTool – Lisa Federer, UCLA.  Recorded July 16th.  Slides also available.

Webinar 7: Digital humanities and the DMPTool – Miriam Posner, UCLA.  Recorded July 30th.  Slides also available.

Webinar 8, Tuesday, August 13, 12-1pm, 126 Memorial Library – Data curation profiles and the DMPTool – Jake Carlson

Tools: LabKey Server

LabKey Server is an open source data management platform designed for organizing and managing data from large-scale research; for example, data from thousands of samples and/or subjects. It provides a secure environment for collaborators at different locations to share, combine, and query data. It is an extensible platform, allowing developers to create custom applications for data analysis and visualization through its API (application programming interface).

LabKey has been used in several biomedical research communities to integrate and analyze data from high throughput assays conducted in distributed labs, including the Immune Tolerance Network, the Atlas data portal for HIV Vaccine studies, and others.

LabKey is currently in use at the UW-Madison Primate Center.

Tools: SpiderOak

What It Is: Cloud-based file storage, synchronization, and back-ups. SpiderOak is available on Windows, Linux, OS X, iOS, Android, and N900 Maemo.

Cost: Free, premium, and enterprise accounts available. The pricing for storage is better compared to Dropbox; $10/month gets you 100GB at SpiderOak vs. 50GB from Dropbox. SpiderOak also has no maximum storage limit. Additionally, it offers a 50% educational discount to anyone with a valid .edu email address.

Ease of Use: SpiderOak’s forte is security, not interface design. The web and mobile interfaces are fairly plain and not nearly as user-friendly as Dropbox’s interfaces. Additionally, while Dropbox has a very simple set-up–everything goes in the Dropbox folder and syncs to all your devices unless you tell it not to–SpiderOak’s set up is a bit more involved. First, you need to set up a back-up. You can choose multiple folders and even specific types of files. After you’ve done this, you can sync the folders across your devices. Finally, access from the web and mobile interfaces is read-only. You can only upload files from the desktop client.

Sharing and Collaboration: SpiderOak provides ShareRooms which allow you to selectively share folders (with anyone; not limited to other SpiderOak users), but the files are read-only. It also allows sharing of a single file, but this is read-only as well. The sharing is more secure: the ShareRoom is access through a unique URL and a RoomKey (password) must be entered, but there is no mechanism for collaborative editing.

Organizing: Other than the traditional hierarchical file system structure, SpiderOak does not have any built-in organizational features.

Exporting: Files can easily be exported. Simply de-select the folders or files in question from the syncing and back-up.

Backups and Versioning: This is one area where SpiderOak does well. It says all historical versions of a file, and does extensive de-duplication, so only the parts that are different are saved, not the entire file.

Security: SpiderOak is, as Ars Technica puts it, “Dropbox for the security obsessive.” Its main selling point is not that’s cloud storage, but that it is secure cloud storage. Unlike the other major cloud storage services, SpiderOak employees cannot access your files. Both Dropbox and SpiderOak encrypt their data, but SO also encrypts the decryption key. The downside to SpiderOak’s superior security is that if you forget your password, your files are gone.


Popular Economic Paper Criticized for Undocumented Errors

A new review of an influential research article on fiscal austerity and GDP finds that the results were tainted in part by an undocumented error in the authors’ Excel dataset. The original research by Carmen Reinhart and Ken Rogoff was titled “Growth in a Time of Debt” claimed that economic growth slowed quite dramatically for countries whose public debt crossed a threshold of 90% of Gross Domestic Product. Since its publication, this finding has often been cited in stimulus/austerity debates, but many economists were unable to replicate it, in part because of the authors’ reticence to share their original data.

The authors of the new review were able to obtain the original data and found a number of problems in the analysis, which are well summarized in this blog post. This episode stands as a cautionary tale about proper data management and open access; these issues are finally being recognized as critical to the integrity of science.

Case Study: Box

I recently sat down with Breanne Litts, a doctoral candidate in Digital Media, Curriculum & Instruction, who has been using Box for file storage and collaboration for her research on learning in makerspaces.

Project needs:
The research project, Learning in the Making: Studying and Designing Makerspaces, is funded by the National Science Foundation.  Breanne and her advisor are collaborating with co-investigators from George Mason University and the Children’s Museum Pittsburgh.  Box appealed to them as a tool for file storage, sharing, and collaboration because it was free and supported cross-institutional collaboration.
The group is conducting ethnographic research at makerspaces in Madison, Detroit, and along the east coast, with the goal of designing activities for the Makeshop in Pittsburgh.  They are conducting interviews and generating video and large audio files, as well as meeting notes, and other documentation related to the research.  They also do brainstorming and initial analysis in Box.  There are eight individuals working on this project, including undergraduate students, so another requirement for their data management tool was the ability to grant differential access privileges.  They organize files using Box’s folder system and have a main folder, a public folder, a private folder in which their sensitive data is stored, and each research site has its own folder.

Favorite features:
Storage and sharing – The group creates Word documents and Google Docs right in Box and appreciates the ability to lock open files to prevent conflicting copies.  This feature is also available on the mobile app.  The previews for documents, audio, and photos are “fantastic”, and the folder system for organization, tagging capability, and search feature are helpful.  Breanne expressed the opinion that the 50 GB of free storage that UW affiliates have access to will be a huge draw for graduate students.
Security – Box makes it easy to comply with IRB requirements regarding access to sensitive information.  In fact, the biggest attraction of Box was that it meets NSF and IRB standards for secure data management.  The ability to create, open, edit, and save directly to Box and not on your machine adds to this security.
Permissions – It’s simple to manage permissions of each individual file, unlike other project management tools the group looked into, which required users to go through an administrator.
Collaboration – Comments, tasks, and discussion features facilitate cross-institution, cross-country collaboration, making it easy to communicate while minimizing the need to email.  The group also found it easy to control email notifications to avoid being overwhelmed, compared to other project management tools.  The ability to link directly to files and folders is very convenient, as is the ability to track changes and revert to previous versions.
Overall, Breanne felt that it was easy to get started with Box.  There’s a low barrier to entry: one can use it without exploiting its total functionality and start getting things done without being overwhelmed.  In contrast, other tools the group considered require too many decisions to set up, as well as requiring meetings with an administrator.  Box offers collaborative teams autonomy, flexibility, and adaptability.
She’s found it to be a great tool for project and data management and collaboration and described it as “Facebook, Dropbox, and a project management tool in one!”  She feels that it does data management, as well as day-to-day project management, better than other tools.