An Introduction to Web Scraping for Research

Like web archiving, web scraping is a process by which you can collect data from websites and save it for further research or preserve it over time. Also like web archiving, web scraping can be done through manual selection or it can involve the automated crawling of web pages using pre-programmed scraping applications.

Unlike web archiving, which is designed to preserve the look and feel of websites, web scraping is mostly used for gathering textual data. Most web scraping tools also allow you to structure the data as you collect it. So, instead of massive unstructured text files, you can transform your scraped data into spreadsheet, csv, or database formats that allow you to analyze and use it in your research. 

There are many applications for web scraping. Companies use it for market and pricing research, weather services use it to track weather information, and real estate companies harvest data on properties. But researchers also use web scraping to perform research on web forums or social media such as Twitter and Facebook, large collections of data or documents published on the web, and for monitoring changes to web pages over time. If you are interested in identifying, collecting, and preserving textual data that exists online, there is almost certainly a scraping tool that can fit your research needs. 

Please be advised that if you are collecting data from web pages, forums, social media, or other web materials for research purposes and it may constitute human subjects research, you must consult with and follow the appropriate UW-Madison Institutional Review Board process as well as follow their guidelines on “Technology & New Media Research”.  (more…)

Link Roundup November 2019

light bulb

Jennifer Patiño

World Digital Preservation Day is November 7th and this year’s theme is “At-Risk Digital Materials.”

Researchers at the University of Hawaiʻi at Mānoa uncovered a glitch in a computer program that produced different results depending on operating systems, possibly affecting more than 100 published studies. A good reminder to make sure you have a detailed README file for any code you create!

Wired reports on a study in Science that revealed racial bias in a widely used algorithm that assigned lower levels of care to Black patients in U.S. hospitals. The study shows how by focusing on healthcare costs, the algorithm replicated disparities in access and provides suggestions on reformulating the algorithm.

Kent Emerson

Researchers at UW-Madison’s Wisconsin Institute for Discovery, working on a project called Wisconsin Expansion of Renewable Electricity with Optimization under Long-term Forecasts (WEREWOLF), are producing mathematical models that will help policy makers make decisions about the future of Wisconsin’s renewable energy resources.

The Roy Rosenzweig Center for History and New Media at George Mason University is celebrating its 25th anniversary. During this time, the RRCHNM has produced some of the most widely used open source digital resources including Omeka, Zotero, and Tropy as well as discrete art and art history projects.

Link Roundup October 2019

light bulb

In this series, members of the RDS team share links to research data-related stories, resources, and news that caught their eye each month. Feel free to share your favorite stories with us on Twitter @UWMadRschSvcs!

Cameron Cook

UW Madison’s Information Technology Office has kindly generated 3 Tips to Manage Google Drive. These are designed to help you manage your “personal” and UW Madison G Suite accounts.

October is National Cybersecurity Awareness Month and the Office of Cybersecurity will be hosting a series of informational events throughout the month.

Clare Michaud

In “Managing 100 Digital Humanities Projects: Digital Scholarship & Archiving in King’s Digital Lab,” the authors outline the process of managing digital humanities projects at King’s College London and stress the importance of partnerships between libraries, IT, and researchers in the creation of successful and sustainable digital projects.

Kent Emerson

In their Annual Report, “Cultivating Princeton’s Data Landscape”, The Center for Digital Humanities @Princeton reflects on their 2018-2019 “Year of Data”. Throughout the year, the CDH hosted a keynote address by Safia Noble, workshops for students and faculty, and served as a hub for connecting researchers, teachers, and resources.

An Introduction to Web Archiving for Research

Web archiving is the practice of collecting and preserving resources from the web. The most well known and widely used web archive is the Internet Archive’s Wayback Machine. The Internet Archive was launched in 1996 by Brewster Kahle with the mission of providing “Universal Access to All Knowledge” The Wayback Machine uses an automated process called crawling to collect pages from all over the web and stores them on servers at the Internet Archive headquarters in San Francisco. 

Institutions such as government agencies, universities, and libraries also actively archive the web, but often with narrower collection scope. There are also many web archiving projects run by smaller teams and individual researchers, and these too usually have specific areas of focus. If there are web resources you are interested in collecting and preserving, with a little research and learning of the tools, you can absolutely create your own web archive. 

Please be advised that if you are archiving web pages, forums, social media, or other web materials for research purposes and it may constitute human subjects research, you must consult with and follow the appropriate UW-Madison Institutional Review Board process as well as follow their guidelines on “Technology & New Media Research”.  (more…)