Written by Heather Wacha
Documenting DH is a project from the Digital Humanities Research Network (DHRN). It consists of a series of audio interviews with various humanities scholars and students around the University of Wisconsin-Madison campus. Each interviewee is given a chance to talk about how they view data, work with data, manage data, or teach data to others. Our final guest, Rob Howard, Professor of Communication, Religious Studies, and Folklore in the Department of Communication Arts at the University of Wisconsin and Director of the Undergraduate Digital Studies Certificate, talks about his digital scholarship and how using big data alongside traditional approaches helps him understand and present his data more fairly. To hear all the interviews, you can go to the Digital Humanities Research Network website.
How do you manage the data you work with?
Howard’s data consists of large datasets from online forum posts. To give you an idea of how much data, one of his projects last year contained more than 11.5 million forum posts. He uses Perl Scripting to put the data into a SQL database, and then runs sets of queries, producing network graphs in which he can search for anomalies or, as he says, “funny things” that draw his interest.
Howard works with his students to look at the larger issues of managing so much data. Using IRB protocols, they consider the methods for storing big data such as millions of forum posts, the challenges these data present to someone doing digital research, and different ways of capturing the data in the first place, which has to be regular and consistent since the data do not stay in their online format for very long. In the end, Howard gives credit to his students for doing a better job than he of finding software or some great database system to manage all the data. For the time being Howard is fine with his personal management system, the “giant folders” system.
What are some of the challenges you encounter with your data?
One of the main challenges of working with online forum posts is that they are written as pieces of everyday communication, as Howard puts it: “speech in the wild.” For one of his recent projects, he wanted to find out what people were thinking and saying about guns. So he went to online gun forums. But the language used in these forums is not consistent or standardized, so running straightforward searches and queries is near impossible. Since Howard’s data is natural and organic, and includes multiple spellings, abbreviations, and vocabularies, it is incredibly dirty. For example, for one project, Howard wanted to know why some people chose not to give their children vaccines. Going to an online forum and then trying to run a query on just the word “vaccines,” however, does not produce the full picture. The posts sometimes use the word “vaccine,” but more often it was written as “vax” or “vaccs” or even as “vaces.”
Trying to run topical analysis on dirty data takes domain expertise, time, and careful attention. When big data is not based on printed material or an institutional document, it becomes much more challenging. While there are some algorithms that can deal with some of the variability, a certain level of expertise is needed to read hundreds of posts and start to identify variants and probables so that the most accurate information can be drawn from the data set.
What excites you about the data you’ve been working with recently?
Although capturing and analyzing everyday communications poses a challenge, Howard gets excited about his data because applying big data methodologies allows him to see, think about, and ask new questions. Before using computational analysis, Howard talked about how he might read something, think about it, think about its genre, etc., and then take some notes. After that, he might have some interesting questions to ask about what he had just read. But as soon as he moves beyond looking at a particular piece of reading as a communication piece and begins to look at it as data, then it, and others like it, can be represented in a different way, by numbers; this allows him to see other things, not necessarily better things, but other things.
Howard considers the work he does to be about people: “people do stuff” and he is interested in knowing why they do it and how it makes them feel and act. But when he thinks instead about people’s posts as data, he is no longer thinking about the people, he is thinking about the structures and ways one can count things. What makes his data exciting is that he can “more honestly or more fairly get a perspective on the data that [he] could not get before.” Taking it a step further, Howard says that “if we are curious about what people are doing, and if we choose not to use these techniques, then we are not doing as good or as fair of a job as possible.” It takes the combination of both traditional and digital methodologies to produce the most accurate picture. In Howards words: “If you really want to be fair and then choose not to take up a lens or pick up a tool then you are not being the fairest you can be with that infinitely complex thing that humans do.”
If you are interested in hearing more, you can go to Rob Howard’s interview on the DHRN webpage where Howard talks more about his research and scholarship.