The Rebecca J. Holz series in Research Data Management is a monthly lecture series hosted during the spring and fall academic semesters. Research Data Services invites speakers from a variety of disciplines to talk about their research or involvement with data.

On March 9th 2016, Alex Hanna, a PhD candidate in the Department of Sociology at the University of Wisconsin – Madison, gave a talk entitled “Data Pipelines and Computational Methods for the Social Sciences”. You can find her slides on the Research Data Services Speakerdeck page.

Alex Hanna’s talk covered three key areas of Hanna’s work. The first was ‘Twitter and Politics’, which was about the work Hanna does with the Social Media and Democracy research group at UW-Madison. The research group has a Twitter archive that currently contains over 50 billion tweets and continues to download around one percent of all tweets produced each day. With the collected data, they are able to study the tweets in relation to political events, for example, they can study how users respond to the sound and physical appearances of the candidates and then map the mentions of candidates onto key speaking points of the debates. Hanna then discussed the hardware and software changes that have been made to the archive as it grows and how the changes have enabled them to process the data more quickly.

The second piece of the talk covered protest event data, a subject that forms the crux of Hanna’s dissertation. The protest event data is extracted from information reported in news articles, focusing on articles available in newspapers for their publishing consistency and role in the historical record. The data collection process for the event data used to be labor intensive and costly as articles had to be collected, filtered, coded by hand to a codebook, and then coded into usable data. Hanna’s dissertation has focused on creating Machine-learning Protest Event Data System (MPEDS). The system improves the process through automation with limited human intervention. This portion of the talk focused on the changes to the process, which have allowed for work on a larger scale while also providing better searching and indexing upon insertion.

In the final portion of the talk, Hanna discussed “Computational Social Science Education”. Hanna seeks to help new and veteran scientists expand their computational language literacies beyond the traditional social science tools of SPSS and STATA by teaching languages such as R, Hadoop, command line and Python. These languages give social scientists new and more flexible functions with which to complete their research, such as web scraping, large scale networks, and automated text analysis. Hanna covered the pedagogical approaches, workshop examples, and lessons learned from teaching these tools.