An Introduction to Web Scraping for Research

Like web archiving, web scraping is a process by which you can collect data from websites and save it for further research or preserve it over time. Also like web archiving, web scraping can be done through manual selection or it can involve the automated crawling of web pages using pre-programmed scraping applications.

Unlike web archiving, which is designed to preserve the look and feel of websites, web scraping is mostly used for gathering textual data. Most web scraping tools also allow you to structure the data as you collect it. So, instead of massive unstructured text files, you can transform your scraped data into spreadsheet, csv, or database formats that allow you to analyze and use it in your research. 

There are many applications for web scraping. Companies use it for market and pricing research, weather services use it to track weather information, and real estate companies harvest data on properties. But researchers also use web scraping to perform research on web forums or social media such as Twitter and Facebook, large collections of data or documents published on the web, and for monitoring changes to web pages over time. If you are interested in identifying, collecting, and preserving textual data that exists online, there is almost certainly a scraping tool that can fit your research needs. 

Please be advised that if you are collecting data from web pages, forums, social media, or other web materials for research purposes and it may constitute human subjects research, you must consult with and follow the appropriate UW-Madison Institutional Review Board process as well as follow their guidelines on “Technology & New Media Research”.  (more…)

An Introduction to Web Archiving for Research

Web archiving is the practice of collecting and preserving resources from the web. The most well known and widely used web archive is the Internet Archive’s Wayback Machine. The Internet Archive was launched in 1996 by Brewster Kahle with the mission of providing “Universal Access to All Knowledge” The Wayback Machine uses an automated process called crawling to collect pages from all over the web and stores them on servers at the Internet Archive headquarters in San Francisco. 

Institutions such as government agencies, universities, and libraries also actively archive the web, but often with narrower collection scope. There are also many web archiving projects run by smaller teams and individual researchers, and these too usually have specific areas of focus. If there are web resources you are interested in collecting and preserving, with a little research and learning of the tools, you can absolutely create your own web archive. 

Please be advised that if you are archiving web pages, forums, social media, or other web materials for research purposes and it may constitute human subjects research, you must consult with and follow the appropriate UW-Madison Institutional Review Board process as well as follow their guidelines on “Technology & New Media Research”.  (more…)

Tool: Tabula

Information adapted from the Tabula website.

What is Tabula?

If you’ve ever needed data that only exists in a PDF format, you’ve likely discovered that you can’t easily copy and paste the data, which makes being able to actually use it difficult.  Tabula is a free, open-source tool you can use for “liberating data tables locked inside PDF files.”

For an example of Tabula being used to extract data for a visualization project, check out this blog post by the Jane Speaks Initiative. Other examples can also be found on the Tabula website.

What can Tabula help you do?

Tabula runs in your web browser, making it easy to browse to the PDF containing the data you need, select the portion of the PDF containing the data tables, and then easily extract the data from the tables into a CSV file or a Microsoft Excel spreadsheet.

How do you get it?

You can download Tabula for free from its website. It is also available on GitHub.

What else should you know?

Tabula works only with text-based PDFs; the developers note that it will not work with scanned documents. Tabula is available for Windows, Mac OS X, and Linux operating systems.