An Introduction to Web Archiving for Research

Web archiving is the practice of collecting and preserving resources from the web. The most well known and widely used web archive is the Internet Archive’s Wayback Machine. The Internet Archive was launched in 1996 by Brewster Kahle with the mission of providing “Universal Access to All Knowledge” The Wayback Machine uses an automated process called crawling to collect pages from all over the web and stores them on servers at the Internet Archive headquarters in San Francisco. 

Institutions such as government agencies, universities, and libraries also actively archive the web, but often with narrower collection scope. There are also many web archiving projects run by smaller teams and individual researchers, and these too usually have specific areas of focus. If there are web resources you are interested in collecting and preserving, with a little research and learning of the tools, you can absolutely create your own web archive. 

Please be advised that if you are archiving web pages, forums, social media, or other web materials for research purposes and it may constitute human subjects research, you must consult with and follow the appropriate UW-Madison Institutional Review Board process as well as follow their guidelines on “Technology & New Media Research”.  (more…)

Tools: Archiving Electronic Lab Notebooks

Electronic Lab Notebooks are becoming important data management tools for researchers in a number of fields. Since ELNs are replacing paper lab notebooks in many labs, can we anticipate a future in which boxes and shelves of decades-old notebooks are replaced with a digital archive of ELN entries? Since ELNs are relative newcomers to the data management ecosystem, some basic discussion about what an ELN archive should contain seems relevant.

There are four general types of data “assets” that can be recorded in a ELN and each has a separate set of considerations for archiving.

DoIT AT LTDE - Blog 100x100 Icons-131. Notebook pages/entries and folders

In ELNs, pages and entries are containers in which text, symbols, equations, and other entities are entered using tools in the ELN interface. ELN pages/entries may be further organized within folders in the notebook.

What needs to be preserved?

All the information entered in ELN notebook fields, including tags and comments. In addition, the organizational structure of the page and hierarchical structure of folders and subfolders needs to be preserved. Therefore, an export package should include notebook page files in formats such as xml, html, or PDF that preserve the content, appearance, and layout of notebook pages and folders. It should also retain the naming schemas and folder hierarchies with the notebook.

DoIT AT LTDE - Blog 100x100 Icons-112. Attached data files

These are data files and documents that were not created in the ELN interface but uploaded to the ELN platform and attached to an ELN entry. These can include things like images, spreadsheets, and data files from lab instruments. ELN platforms generally allow the user to add annotations and comments and associate them with these data files.

What needs to be preserved?

All the  data files in their original, native formats plus any annotations added in the ELN interface. Annotations and comments should be preserved as either  separate files linked to the data files or as components of page/entry files in the export package, rather than altering the data files themselves. If multiple versions of individual files were attached to an ELN entry/page, metadata about the versions, including dates, should be also be preserved.

DoIT AT LTDE - Blog 100x100 Icons-073. Linked data files

These are files and documents that are linked to an ELN entry but reside on other systems such as lab or department servers.

What needs to be preserved?

Although linked files are located external to the ELN platform, an archive of all the data associated with a notebook should include a record of the server address of the linked file plus evidence of whether the server location is still accurate for the file at the time of archiving. One mechanism to assure that the file associated with an ELN entry is valid is to generate a checksum using common algorithms like MD5 or SHA-1 that would be stored with the file location. Ideally, the ELN platform would manage this checksum generation. In addition, it would be beneficial for the ELN platform to perform periodic link checking even before archiving is done to assure the continued presence of the remote file.

DoIT AT LTDE - Blog 100x100 Icons-094. Metadata

This is information about the provenance of an ELN page/entry and includes things such as date and time, name of the individual creating/editing, the version history of attached data files, etc.

What needs to be preserved?

Provenance information that is viewable in the ELN interface should be included in archives of the ELN pages. More detailed metadata is contained in log files collected on the database and application servers of the ELN platform and some components of this information that provide evidence of user access and actions may also need to be preserved in an ELN archive.