An Introduction to Web Archiving for Research

Web archiving is the practice of collecting and preserving resources from the web. The most well known and widely used web archive is the Internet Archive’s Wayback Machine. The Internet Archive was launched in 1996 by Brewster Kahle with the mission of providing “Universal Access to All Knowledge” The Wayback Machine uses an automated process called crawling to collect pages from all over the web and stores them on servers at the Internet Archive headquarters in San Francisco.

Institutions such as government agencies, universities, and libraries also actively archive the web, but often with narrower collection scope. There are also many web archiving projects run by smaller teams and individual researchers, and these too usually have specific areas of focus. If there are web resources you are interested in collecting and preserving, with a little research and learning of the tools, you can absolutely create your own web archive.

Please be advised that if you are archiving web pages, forums, social media, or other web materials for research purposes and it may constitute human subjects research, you must consult with and follow the appropriate UW-Madison Institutional Review Board process as well as follow their guidelines on “Technology & New Media Research”.

The Purpose of Web Archiving

Web archives have many similarities with traditional format archives in that their mission is to collect resources for research and cultural memory. Like traditional archives, web archives are often accompanied by collection development policies, finding aids, metadata, and a host of other familiar features. But web archives differ from other archives because the resources they collect are online and are frequently changed, removed, or orphaned (not reachable by any links). These differences introduce many complications into the process of archiving the web, but the most persistent challenge is capturing resources that are constantly in flux, being updated, relocated, or redesigned. For this reason, web archiving is an ongoing process, and pages are often crawled on continuing schedules designed to capture their different iterations.

It is worth mentioning quickly that web archiving is meant to preserve web pages as they appeared when they were live. This means that much of the data the process collects is in the form of images, video, and web design elements. If you are primarily interested in just the text that appears on Twitter or online message boards for instance, you may want to investigate web scraping using Python or other similar tools.

The Practice of Web Archiving

Because web resources require frequent collection, there are several automated tools for performing the crawls on preset schedules. The most widely-used and accessible web archiving tool is Archive-It developed by the Internet Archive. Using Archive-It is relatively straightforward, and with a little testing and practice, most users can start their own archive quite quickly (though it does require a subscription for data storage). There are other tools available as well. Web Recorder is an open-source program created by Rhizome.org for collecting and viewing web resources, and like Archive-It, it is meant for everyone to use.

If you find yourself choosing between these two tools, know that Archive-It is best suited for:

collecting a larger array of sites,
crawling them on set schedules,
and capturing their look and feel.

Web Recorder is designed to

capture more dynamic resources such as social media sites, sites with a lot of videos or gifs,
and involves a more manual process of selecting the resources to collect.

There are other tools for creating web archives, so if Archive-It and Web Recorder don’t meet your needs, don’t give up.

The Ethics of Web Archiving:

Web archiving often involves the collection of resources created by others, so it is common practice to ask for permission before beginning crawls. This can be tricky because it’s not always clear who owns the site, who created the content from a given page, and even if the owner/creator is clearly listed, their contact information may not be up to date.

In general, if you contact a page admin and don’t hear back, you can initiate your crawl and make the data private later if they object to the collection. If administrators prefer their pages not be crawled, they can also use a tool called robots.txt to prevent crawls from collecting their resources. If one of the sites you attempt to crawl is using robots.txt, your tool will notify you the crawl has been prevented.

The other major concern is collecting user information without their consent. This is most often an issue on social media sites such as Twitter, Facebook or online forums and message boards like Reddit. In these cases, it’s worth asking yourself whether or not the information you’re collecting from these locations justifies the privacy concerns, and, if that answer is yes, then ensuring you’re following appropriate data protection practices once you’ve collected it. Prior to your project, ensure you understand what constitutes sensitive data on campus and reach out to both your IT and IRB about your project so you have a data management plan prior to collecting any websites.

Things to Keep in Mind: Data

Web archives can get massive in a hurry, and because the crawling process is automated, you may find yourself with a huge amount of data without realizing it. This is particularly true if you’re collecting pages full of video, images, or large PDF documents. You can keep your archive in manageable sizes by capping the data your crawler captures, and by using precise URLs to describe your resource rather than broader domains with many subpages that may not be essential to collect.

You should also think carefully about the purpose of your web archiving project. As this Collection Development Policy from Stanford outlines, defining a web archive is as much about deciding what not to collect as it is about what you will be collecting.