Guides, Tutorials, and Courses for Learning About Data Management

by Cid Freitag, ‎Instructional Technology Program Manager at DoIT Academic Technology

Notebook-Data

If the data you need still exists;
If you found the data you need;
If you understand the data you found;
If you trust the data you understand;
If you can use the data you trust;
Someone did a good job of data management.

Rex Sanders ‐ USGS‐Santa Cruz*

Data management practices have been described in detail in a variety of documentation and tutorials, which may focus on specific needs and resources applicable to the organization that produced them. The following is a selected list of resources that are general enough to apply to different disciplines, and more broadly than the university or agency that developed them.

Guides and Tutorials

Data Science MOOCs

Several Massively Open Online Courses cover topics related to data analysis and research methods. Even if you choose not to do the coursework and earn a statement of completion, it’s easy to sign up for the courses, which gives you access to lectures and examples.

The Class Central website has curated a list of several data science and analysis methods MOOCs, developed by reputable sources.

The MOOCs listed here have been developed through Johns Hopkins University, and offered through the Coursera platform. They are part of a Data Science Specialization series of of courses, and have applicability to data management practices outside of specific analytical techniques. Each of these courses lasts 4 weeks, and are frequently offered. Currently, there is a new offering of each course starting each month from March through June, 2015.

The Data Scientist’s Toolbox, Jeff Leek, Roger Peng, Brian Caffo

“The course gives an overview of the data, questions, and tools that data analysts and data scientists work with.” It focuses on a practical introduction to tools, using version control, markdown, git, GitHub, R, and RStudio.

Getting and Cleaning DataJeff Leek, Roger Peng, Brian Caffo

“This course will cover the basic ways that data can be obtained…..It will also cover the basics of data cleaning and how to make data “tidy”… The course will also cover the components of a complete data set including raw data, processing instructions, codebooks, and processed data. The course will cover the basics needed for collecting, cleaning, and sharing data.” Tools used in this course:  Github, R, RStudio

Reproducible Research, Jeff Leek, Roger Peng, Brian Caffo

“Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them…This course will focus on literate statistical analysis tools which allow one to publish data analyses in a single document that allows others to easily execute the same analysis to obtain the same results.” Tools: R markdown, knitr


*Rex Sanders quote from: Environmental Data Management: CHALLENGES AND OPPORTUNITIES, Jamie Gerrard | March 2014

 

Looking for additional information about research data management? Contact us.

Data Visualization: Choosing Tools and Workflows Across the Research Process

Introduction:

Data Visualization can serve as a complement to statistics and as a part of your research process from analysis through publication. Visualization works with the human eye-brain system and can help a viewer see relationships, patterns, and outliers in his or her data.

Data visualization as a broad term can refer to anything from a small bar graph with a few values to an elaborate poster-like display that integrates multiple graphs, maps, photographs, short annotations, and longer text.

The variety of tools and types of visualizations can have varying degrees of alignment with data analysis tools. When choosing a tool and a workflow, a model developed in cartography can be helpful to connect the purpose of visualization with your design and communication needs. Although the model was developed for mapping, it can apply to other disciplines also as a way to help consider the audience and purpose of visualization, and help inform tool and workflow choices.

Model:

This model, proposed by DiBiase presents a research process with four stages:

  1. Exploration of data to reveal pertinent questions
  2. Confirmation of apparent relationships in the data in light of a formal hypothesis
  3. Synthesis or generalization of findings
  4. Presentation of the research at professional conferences and in scholarly publications

dibiaseDiagram

DiBiase Model: Visual Thinking/Private Realm

The visual thinking, tools, and methods can change as your research stages change. During early stages of your data analysis, visualizations may complement statistical methods and help you explore the data, to look for patterns or outliers. You might not show initial visualizations to anyone else, nor will they all result in meaningful insights. You may not show initial visualizations to anyone else, nor will they all result in meaningful insights.

The early stages are typically done privately, as an individual or small team of experts deeply involved with the research subject. At this stage of your research, the characteristics of visualization tools should support you in working efficiently with your time to generate multiple visualizations with repeatable, documentable methods. Visual design elements, such as colors and graphic symbols, the types of visualizations, and levels of detail, can be chosen to help you identify patterns, similarities and outliers. At this stage, the audience is an individual researcher or small team who is familiar with the data; the visualizations aren’t intended for a broader audience.

DiBiase Model: Visual Communication/Public Realm

As the research progresses, the work shifts as you begin to communicate ideas and results to colleagues and peers, and eventually to a broader public.

As your audience widens, the visualizations change to serve as a tool for communicating beyond the research team and possibly to an audience with less expertise in the field. Visualizations that were clear to experts might not be understood by a broader audience without the depth of knowledge or interest in the subject.

Graphic design elements become more important to help you use your visualizations to communicate your research results to an external audience. Choices of chart type, level of detail, color, symbols, typography, labels and annotation can make a difference in the clarity of communications.

A Simple Example

This simple example can help illustrate a distinction between an exploratory graph and a communication graph, potential tools and one example workflow.

Exploratory Graph:

CO_In_obs

This graph was produced in R, a language and environment that offers several statistical and graphical techniques. R is available as free software and it offers strengths in its ability to handle data, provide a programming language, and allowing users to define new functions. R is extensible through packages. Many packages are available that can provide specialized functions applicable to a variety of domains. One strength of R to the roles of visualization in a research process is in its ability to generate individual or multiple graphs through scripts that then serve as a documentation of the data handling and visualization process.

When a research project has progressed to a point of showing graphs beyond the researcher(s) closely familiar with the data, the communication value of the graphs could be enhanced by progressing beyond R’s default graph functions and using packages that offer additional graph functions.

Communication Graph:

CO-ObsInact-Commun

Another option for generating public-audience graphs involves exporting a graph produced directly from the data into software that offers flexibility in design and pre-publication details. The example shown here was created by importing the scatterplot created in R into Adobe Illustrator, a vector-graphics software, and editing it for design elements. A strength of illustration software is that it affords flexibility to fine tune graphic design through wide choices in type, colors, shapes, annotations, and the ability to alter design element placement. A disadvantage is that a hand-editing process is prone to human errors. Because the graphic is separated from the data management environment, the editing tasks cannot be automatically replicated or easily traced.

The tools and methods that are effective for data exploration and analysis might not be the same as those for fine-tuning the visualizations for a public audience. As you work through a research process, considering the purpose and audience for your visualizations may help inform your choices in tools, methods, and efforts spent in polishing the graphic presentation.

DiBiase, David. 1990. Visualization in the Earth Sciences. Department of Geography, The Pennsylvania State University.  http://www.geovista.psu.edu/publications/others/dibiase90/swoopy.html

The R Project for Statistical Computing. http://www.r-project.org

Adobe Illustrator. http://www.adobe.com/products/illustrator.html