HathiTrust Featured Collections Project

At the intersection of digital libraries, big data, and academic impact, our team had the opportunity to contribute to a meaningful project in support of the University of Wisconsin–Madison Libraries. Our mission was clear: create a workflow to help surface and spotlight digitized materials contributed by UW–Madison to the HathiTrust Digital Library and help librarians identify items that could be featured in curated digital collections and datasets.

Why This Matters

As a key contributor to the HathiTrust Digital Library, UW–Madison has contributed over 600,000 digitized texts since 2006. However, due to the overall size of the undertaking and the complicated variety of rights statuses, there was no streamlined way to isolate our campus’s contributed items or easily determine which were public domain, restricted, or open access. This limited their use in UW-Madison branded featured digital collections or in curated datasets that could be made available for computational research.

What We Did

To highlight UW–Madison’s contributions within the vast HathiTrust repository, our team developed a practical, metadata-driven workflow. We first identified all records marked with our campus code (wisc), isolating materials contributed by UW–Madison. Next, we categorized these materials based on their access rights: public, restricted, or public domain.

Due to the massive scale of the dataset (over 18 million records), we built a SQLite database, enabling quick and flexible queries to find specific materials easily. We then linked HathiTrust digital records to UW–Madison Libraries’ physical collections using Alma catalog metadata (via the oclc_num identifier). Finally, we organized the records by campus libraries such as Memorial and College Library to simplify thematic curation by librarians.

What It Enabled

Our project uncovered many previously overlooked public-domain materials uniquely contributed by UW–Madison. Librarians gained clearer, faster paths to curating thematic digital collections, while creating workflows to isolate institutional contributions from the HathiTrust dataset and merge HathiTrust records and Alma catalog records yielded a dataset that is much more manageable and robust. Ultimately, this helped the libraries gain insight into past contributions and to make informed decisions on what materials to prioritize for featured collections and datasets.

What We Learned

This project deepened our understanding of large-scale digital libraries and the value of well-structured metadata. We gained practical experience using Python and databases to clean and manage extensive datasets. Crucially, we learned how automation streamlines workflows involving millions of records. Most importantly, we saw firsthand how robust data infrastructure directly supports librarians and curators in making real-world decisions and enhancing public access to valuable knowledge.

Looking Ahead

Projects like this remind us that digital libraries aren’t just about preservation, they’re about participation. With better workflows that leverage metadata and access frameworks, libraries can become even more powerful spaces for discovery, equity, and collaboration. For those in the digital humanities, metadata management, or academic libraries, we hope this work highlights the impact of technical infrastructure in empowering curation, access, and community storytelling.

Rahil Virani is currently pursuing his Master’s Degree in Information Science. Through his role as a Research Data Analyst & Initiatives Assistant for UW-Madison Libraries, he provides technical and data support for researchers and teams, particularly focusing on MINDS@UW and Research Data Services (RDS). His passion lies in improving library digital services through automation and data-driven research, aiming to enhance efficiency and reach to the diverse UW student community. His work involves gathering data and delivering actionable insights to support strategic planning and targeted initiatives

Research Data Services (RDS) is an interdisciplinary organization committed to advancing research data management practice on the UW-Madison campus. We focus on providing researchers with the tools and resources that support their efforts to store, analyze and share data.