The Holz Featured Researcher series invites UW-Madison researchers to discuss their work and illuminate their workflows, insights, and challenges when it comes to working with their research data.
Eric Hoyt is an Associate Professor in the Department of Communications at the University of Wisconsin-Madison specializing in media and cultural studies and film. He wrote a book, Hollywood Vault: Film Libraries before Home Video, about Hollywood studio content libraries and is the Director of the Media History Digital Library. His current book project is on the history of Hollywood trade papers.
Research Data Services: Tell us a little bit about the research you do here at UW-Madison. What types of data do you interact with for your work?
Eric Hoyt: I am a media historian and a digital humanist, and sometimes that work takes me into its own directions. But if those things are like circles in a Venn diagram, there’s the place where they come together. So, in the media history side, I’ve done quite a bit of research into the history of American film and broadcasting industries.
I wrote a book called Hollywood Vault, which is about the history of the content library, how Warner Bros. and Universal and the big studios and collections of movies under their copyright became valuable over time. But then there’s the DH side of the work that I do. I’ve been involved in a number of projects, all media connected in one way or another. I work on the PodcastRE Preservation Project with my colleague Jeremy Morris. And I’ve been working on a digital project about the history of educational broadcasting and public broadcasting with Stephanie Sapienza at the University of Maryland. But the project that I’ve worked on for the longest and that I’m best known for in my field is Media History Digital Library where for the last basically 10 years, we’ve been scanning and making accessible trade papers and magazines and books from the histories of broadcasting, film, recorded sound and making them widely accessible for free on the web. We’re up to about two and a half million pages of material that we scanned. We’ve never had a single copyright issue because we always do our homework about checking into what is still protected by copyright and what’s in the public domain. And we’ve also entered into some licensing agreements with people who are willing to very generously share work that they control the rights to.
So that’s been a really gratifying project because we are sharing resources the film and broadcasting historians used to look at for a long time hunched over microfilm stations. But then, meanwhile, there are these other magazines, really interesting trade papers and publications that they were never even using in the first places because they’re very rare and oftentimes had never been put on microfilm in the first place. So it’s been great to both take well-known periodicals like Variety and Photoplay and make those freely accessible, but also introduce researchers to really fascinating sources like the Film Mercury and Film Spectator and other magazines that had really interesting editorial voices, but never became known within the discipline. So doing that digitization work for Media History Digital Library is one way that I make that available. But then I’m also right now completing a book about the history of Hollywood trade papers. This is an example of where the DH work comes back to inform more of the media history or traditional scholarly research that I do. My big projects at the moment are finishing that book on the history of Hollywood trade papers and looking at why there were so many of these trade papers competing against each other at the same time, much more than other American industries.
The second project has just been trying to make the Media History Digital Library better in multiple ways. We were very fortunate to receive an ACLS Digital Extension grant last year. My co-PI, Kelly Conway, and I are working on this grant funded project where our goals are to globalize and then enhance the Media History Digital Library. So, we’re working with a team of experts and archives all over the world at digitizing non-English language publications and adding those to the collection. And then we’re also working on improving the database and user interface as well. We were able to get a lot done quickly by not always following the best practices when it came to how you construct a database and entering the right metadata the first time around. So there were tradeoffs on a very technical level with some of those things. Now we’re kind of coming back around to try to improve some of those more foundational things.
RDS: What are the sources of this data and how do you acquire it? Can you briefly describe your workflows for ingesting, managing, and processing your research data?
EH: It can come from different places. We really specialize in telling the history of media from the written record. I mean, the Media History Digital Library, what we are scanning are not the recordings or the films themselves. We are working from the paper. We do books, but especially magazines that we tend to work with. So, the first step is you have to find an institution that has the material. Actually, I’ll just pause here to say, I think one thing we did really well, a philosophy of Media History Digital Library that has guided our process ever since then is that we’ve really put the researcher and user first and then worked backwards from there. I think some institutions, archives, libraries, understandably begin from the standpoint of like, ‘Well, what’s in our collection? Like, what do we have?’ And oftentimes they have amazing things. Really remarkable things. But a lot of the work becomes kind of outreach work from there. And in our case, it was like, ‘Well, what do researchers in our field need? What do they know that they already want more of and how can we give them even more too?’ That led to a very collaborative model where it started out with borrowing big bound volumes of Photoplay and Film Daily from a private collector living in the San Fernando Valley.
I was a grad student at USC back then, about ten years ago. I would drive to his apartment in the San Fernando Valley and pack these up into boxes and transport them to the Internet Archive’s scanning center and then take them back. We did a lot of early work there. Once we had shown that we could get things done and do good work, that’s actually when it became easier to get the more established libraries and archives involved. I have to give big credit here to the founder of the project, David Pierce. He stepped down from it two and a half years ago, at the start of 2018. He is now in a high-ranking position within the Library of Congress’s Motion Picture, Broadcasting and Recorded Sound Division, so he had to focus on his work there. But David has been part of the field for a long time and it was his relationships and contacts, including that collector I mentioned, as well as then with places like The Museum of Modern Art, the Library of Congress, and the Pacific Film Archive that led to us being able to borrow these magazines and have them scanned.
So, that’s where the model for acquisition and in creating the data started. We would borrow materials from a collection, whether it was a private collector or an institution, and then ship them to an Internet Archive scanning center. The Internet Archive also preserves the files and creates derivative files. So, we started with this Internet Archive collection. They do a lot of things really well, especially on the back end, I think. But their user interface, especially back then, was pretty confusing. People would get just really lost when we were trying to point them to our items. So, a lot of my work and involvement in the project, especially between 2011 to 2014 or so, was trying to develop better entry points and interfaces for our users and researchers to work with.
The first attempt at this was a WordPress site that was really just hyperlinks, kind of elegant JavaScript, like accordion style lists that people could open and close. But it was really just hypertext pointing people to these magazines, and that worked OK if you were familiar with the difference between Film Daily and Photoplay, if you were like an obsessive film nerd who knew the year that every movie was made. For most of our researchers, they just didn’t. They were interested in Cary Grant, or they were interested in the history of race and vaudeville or other questions like that. They wanted to run full text search queries. So that’s where Lantern came in. We developed a search index using a lot of the same software and technology we’ve used or the things that the library uses, think Apache Solr as the search index. Ruby on Rails’ Blacklight App as well. Then just kind of like tailoring things to our needs. Most of what I do when it comes to developing any kind of web-based software platforms is taking open source stuff and kind of mussing it up and using it for my own purposes. But Lantern became a way for people to run full text search queries. This is where the data model gets a little messy because you’re also still hard coding HTML links into the WordPress web site.
Plus, all of it was pointing back to the Internet Archive. Whereas traditionally you’d have your own database and you’d update the database and then the search index Solr would periodically refresh from that. In this case, it was more kind of decentralized and anytime we wanted to add something we would personally have to run a script and oftentimes edit metadata within the Python script. So, like I mentioned before, it was a way to get a lot done with a low overhead and do it pretty quickly. But as the project grew, it became a lot less efficient. We’ve been able to scan things from lots of different sources. And then recently, I’ve been scanning things myself by hiring grad students to scan it here on campus, because now we have that infrastructure. But we still use Internet Archive as the repository for the data because when we post something there, not only does it get preserved there or a copy of it does, but they generate the access files, the JPEG 2000’s, they run ABBYY FineReader. They do all the derivative file work.
The interview has been lightly edited for content and clarity.