Written by Laura Schmidt

Documenting DH is a project from the Digital Humanities Research Network (DHRN). It consists of a series of audio interviews with various humanities scholars and students around the University of Wisconsin-Madison campus. Each interviewee is given a chance to talk about how they view data, work with data, manage data, or teach data to others. Most recently, we interviewed Shanan Peters, Jon Husson, and Aimee Glassel of GeoDeepDive, a project that builds a scalable, dependable cyberinfrastructure to facilitate new approaches to the discovery, acquisition, utilization, and citation of data and knowledge in the published literature. Their interview is now accessible on the DHRN website.

How would you define the data involved in this project?

The core data for the GeoDeepDive project comes down to metadata for Shanan–bibliographic metadata, the citation metadata, the link back to the original source–all of the metadata that goes along with the published document. The other part of the data is what the original document is, usually PDF (it’s far-reaching and a lot old documents have been converted into PDF form for better or worse). Beyond that, the data that’s really interesting is the parsed and annotated PDFs. These documents have been passed through software programs and they are labeled with terms within it by a generic name density recognizer and are often used with other tools that extract information (within the terms of the publishers’ agreements) to create something entirely different. In short: the data are diverse!

 

How might you recommend managing data for researchers in general?

Shanan recommends data management in two different ways. The first focuses on researchers who are producing data from the field or from their own research. There is an active conversation at funding agencies, because there’s a mandate to archive data and make it available. People have spent a lot of time building databases and manually populating data for different repositories and Shanan supports this, but suggests one major thing: Publish your data.

“Completely and well,” he reiterates. “Somewhere along the line we got out of the habit of doing this, because of the older cost of print and there was an impetus not to publish full data sets, because it was expensive to publish long tables or whatever.” In the digital age, there are accommodations for this data by adding it to online appendices, but it’s not Shanan’s ideal. He argues that the best way to publish your data is to publish it in the old-fashioned monograph sense. “Make it available, in a curated collection, like GeoDeepDive or some other online resource. That’s a great way to manage data, because with machine reading learning approaches, we can actually do a lot with creating structured, specifically tailored databases.” Shanan also believes that the data should be complete and verbally described. The necessary, boring metadata needs to go along with your work, because machines will eventually be able to read your work properly.

 

Is there a particular kind of data or particular data challenge that you’ve run into in your project?

Jon was excited to answer about his data challenges, because it echoes issues most people have with data: confusing yet day-to-day problems. His greatest challenge was that, with his stromatolite work, it was the first application to be developed. “The first person take a stab at this was actually an undergraduate intern, someone who I knew from my time at Princeton, Julia. She broke the first-generation app that was designed to look for stromatolites and quantifying their spatial and temporal distribution across history. I picked up her app and developed it further.” However, Jon wasn’t used to the data and how it was processed. There were a lot of quirks that continued to pop up for Jon, like columns being too close together would create gobbledygook data. He also had never come across ligatures before he started working on the project, so his python script wasn’t ready to handle these annoyances. However, Jon was able to massage the data on both the infrastructure side and the application side, in order to make his tasks work. “It sort of went on and on and on. It was all about getting used to how the data was represented in our particular infrastructure and how to work with that.”

 

So what do you find most interesting and exciting about the data that you work with?

“Having access to three and a half million published documents has never been possible before,” Jon says, clearly enthusiastic. “To continue with the stromatolite example, I’ve been hearing about stromatolite since I was an undergraduate and people have long held ideas on how stromatolites change in their diversity and abundance throughout time. However, until GeoDeepDive came along, there weren’t the tools available to test some of these ideas that have been qualitatively expressed to me. GeoDeepDive offered me the opportunity to express those ideas directly.”

Beyond Shanan’s excitement about the project’s ability to process documents will continue to evolve at a rapid pace, he’s excited about the lack of silos within GeoDeepDive in terms of discipline or knowledge. “It will read the literature and it will discover information across disciplinary boundaries and expose it to you in your application.” They have a lot of work to do, but, down the road, GeoDeepDive won’t just be a tool that people use to get information from literature. The project will be a tool that people use to discover how their knowledge is connected to other domains of knowledge.

 

How do you see the GeoDeepDive project fitting into the Digital Humanities community on campus?

“GeoDeepDive is a horrible name that doesn’t reveal what we do at all.” They admit the name is a holdover from the early days of the project, because they are all invested in the Geo side of things. However, the project is completely agnostic with respect to discipline. “Any group on campus that leverages information within published documents can fit comfortably with the GeoDeepDive project of digital library and computing infrastructure.” Shanan believes that it applies very broadly to anyone interested in extracting, synthesizing and analyzing information in published texts and encourages them to, “have at it!”