Written by Laura Schmidt
Documenting DH is a project from the Digital Humanities Research Network (DHRN). It consists of a series of audio interviews with various humanities scholars and students around the University of Wisconsin-Madison campus. Each interviewee is given a chance to talk about how they view data, work with data, manage data, or teach data to others. Most recently, we interviewed Christina Koch, the Research Computing Facilitator at the University of Wisconsin’s Center for High Throughput Computing. Mainly working with scientists, she works with scholars who work with large-scale computational projects. Her interview is now accessible on the DHRN website.
First things first, what kind of humanities projects would your Center be interested in or could contribute to?
Christina’s center provides computing, the best project being one that uses computing in some way as a part of the analysis or study, particularly at a large scale. If someone had a project where they had a script or program that was analyzing text data and they had a lot, a lot, a lot of text data, that would possibly be a good fit, because analyzing thousands of documents on your own computer is going to take a long time. Truly anything where you’re somehow managing, analyzing, changing, and creating things using data, the Center can help.
How do you think data differs in different disciplines and fields?
Christina’s conversations with digital humanists is that the data that they’re working with is in-between in some ways. There is both text and image data, but the challenge is that it’s not as structured as the data in other fields. There are limits on what sequence data is supposed to look like—strings of letters—and there are programs that check and remove things that shouldn’t be there. That’s very simple and easy to deal with.
Though not as familiar with DH projects, the challenges she sees in these projects (which range from images you want to OCR or analyze to image recognition through machine-learning to working with digitized texts) is dealing with the structure-less-ness of the data. Heather, our interviewer, couldn’t help but chime in: “It’s messy! Messy and fuzzy data!” Christina agrees and argues that analysis is dealing with the messiness.
Christina admits that she doesn’t have the more detailed subject-specific knowledge that would be useful for specific data types, but she really enjoys teaching people data management. Data in general is hard and we often don’t have any data skills. With her experience as a math major, not even Christina was taught data management. Even basic data management is something she had to learn on-the-ground: “I’ve had the experience of, “oh, so if I want to analyze the number of people who come to our office hours, we have to track that! We have to write that down! I should probably have a spreadsheet with that kind of information.” We all laughed at these first steps that Christina remembered, particularly because she’s taught workshops on basic data management skills through the Data Carpentry Organization, a non-profit that teaches data skills to researchers.
One of the biggest things she can recommend, particularly for humanists who deal with messy data, is taking the time to create metadata. It’s a tool that you can leverage to make things easier for yourself. She admits that it’s boring and unsexy, but there is power in metadata, even in just naming your files correctly, so you can go back to them. Even in Christina’s own workflows—not even doing research—she says that working a little up front, making a plan of her organizational system and putting in an ounce of effort sticking to it, she saves herself a lot of time. “It’s very basic, but very powerful.”
So what do you find most interesting and exciting about the data you work with?
On one hand, Christina notes, data can feel like a huge burden, because it can be really big and unwieldy. There are piles and piles of data, particularly in the life sciences, where sequencing has become cheap, but people don’t know what to do with it. It’s new and the methods, tools and how you’re supposed to deal with it haven’t been taught. Christina argues that lab protocols are clear, because they tell you how to do x, y, and z, but data isn’t talked about enough. There are a constant sea of questions: How do you archive it, how many copies do you keep, do you keep the raw data or also the analyzed data? It overwhelms everyone.
Christina feels that the most exciting thing about the data explosion happening is the messiness. With the growth of computing, to even analyze a terabyte of any kind of data would be next to impossible twenty or thirty years ago, unless you were at a very, very fancy computer center. The possibilities of doing new research and being able to look for patterns on a larger scale is what Christina loves. The exploratory find-what’s-out-there thing is really, really interesting to her. There are things out there that you wouldn’t be able to learn otherwise without computing.