Like web archiving, web scraping is a process by which you can collect data from websites and save it for further research or preserve it over time. Also like web archiving, web scraping can be done through manual selection or it can involve the automated crawling of web pages using pre-programmed scraping applications.
Unlike web archiving, which is designed to preserve the look and feel of websites, web scraping is mostly used for gathering textual data. Most web scraping tools also allow you to structure the data as you collect it. So, instead of massive unstructured text files, you can transform your scraped data into spreadsheet, csv, or database formats that allow you to analyze and use it in your research.
There are many applications for web scraping. Companies use it for market and pricing research, weather services use it to track weather information, and real estate companies harvest data on properties. But researchers also use web scraping to perform research on web forums or social media such as Twitter and Facebook, large collections of data or documents published on the web, and for monitoring changes to web pages over time. If you are interested in identifying, collecting, and preserving textual data that exists online, there is almost certainly a scraping tool that can fit your research needs.
Please be advised that if you are collecting data from web pages, forums, social media, or other web materials for research purposes and it may constitute human subjects research, you must consult with and follow the appropriate UW-Madison Institutional Review Board process as well as follow their guidelines on “Technology & New Media Research”.
How it Works
The web is filled with text. Some of that text is organized in tables, populated from databases, altogether unstructured, or trapped in PDFs. Most text, though, is structured according to HTML or XHTML markup tags which instruct browsers how to display it. These tags are designed to help text appear in readable ways on the web and like web browsers, web scraping tools can interpret these tags and follow instructions on how to collect the text they contain.
Web Scraping Tools
The most crucial step for initiating a web scraping project is to select a tool to fit your research needs. Web scraping tools can range from manual browser plug-ins, to desktop applications, to purpose-built libraries within popular programming languages. The features and capabilities of web scraping tools can vary widely and require different investments of time and learning. Some tools require subscription fees, but many are free and open access.
Browser Plug-in Tools: these tools allow you to install a plugin to your Chrome or Firefox browser. Plug-ins often require more manual work in that you, the user, are going through the pages and selecting what you want to collect. Popular options include:
- Scraper: a Chrome plugin
- Web Scraper.io: Available for Chrome and Firefox
Programming Languages: For large scale, complex scraping projects sometimes the best option is using specific libraries within popular programming languages. These tools require more up front learning, but once set up and going, are largely automated processes. It’s important to remember that to set up and use these tools, you don’t always need to be a programming expert and there are often tutorials that can help you get started. Some popular tools designed for web scraping include:
- Scrapy and Beautiful Soup: Python libraries
- rvest: a package in R
- [see tutorial here]
- Apache Nutch: a Java library
- [see tutorial here]
Desktop Applications: Downloading one of these tools to your computer can often provide familiar interface features and generally easy to learn workflows. These tools are often quite powerful, but are designed for enterprise contexts and sometimes come with data storage or subscription fees. Some examples include:
Application Programming Interface (API): Technically, a web scraping tool is an Application Programming Interface (API) in that it helps the client (you the user) interact with data stored on a server (the text). It’s helpful to know that, if you’re gathering data from a large company like Google, Amazon, Facebook, or Twitter, they often have their own APIs that can help you gather the data. Using these ready-made tools can sometimes save time and effort and may be worth investigating before you initiate a project.
The Ethics of Web Scraping
Included in their introduction to web scraping (using Python), Library Carpentry has produced a detailed set of resources on the ethics of web scraping. These include explicit delineations of what is and is not legal as well as helpful guidelines and best practices for collecting data produced by others. On the page they also include a Web Scraping Code of Conduct that provides quick advice about the most responsible ways to approach projects of this kind.
Overall, it’s important to remember that because web scraping involves the collection of data produced by others, it’s necessary to consider all the potential privacy and security implications involved in a project. Prior to your project, ensure you understand what constitutes sensitive data on campus and reach out to both your IT and IRB about your project so you have a data management plan prior to collecting any websites.