Researchers looking to get started with text analysis can face two common barriers. The first is that copyright restrictions can make it difficult to get access to datasets for text analysis. The second is that there can be a steep learning curve to creating your own algorithms, especially without being able to see how they operate. Luckily, there are tools available through the UW-Madison Libraries that can help researchers navigate some of these barriers to text analysis. Gale’s Digital Scholar Lab and HathiTrust’s HTRC Analytics allow scholars to build datasets from content available through UW-Madison Libraries and analyze them with the algorithms provided. The Libraries also have access to the Clarivate Web of Science dataset and provide example code and a tutorial.
Digital Scholar Lab
Digital Scholar Lab is an online tool from the publisher Gale that allows researchers to build their own collections and analyze the datasets. The tool makes content from UW-Madison Libraries’ subscription to Gale Primary Sources available for text analysis and visualization using built in and commonly used text analysis functions, such as named entity recognition, topic modeling, parts of speech, and others.
By linking their Google or Microsoft account, researchers can push content from Digital Scholar Lab into their cloud storage space. And lastly, Digital Scholar Lab provides easy-to-follow tutorials as well as sample projects, so it can be especially helpful to those looking to get started with text analysis.
HTRC Analytics
UW-Madison is a member of HathiTrust which is a partnership of academic and research institutions that offers a collection of millions of titles digitized from libraries around the world. Through our membership, UW-Madison researchers are able to use the HathiTrust Research Center’s (HTRC) tools and services to access the full corpus for text mining and non-consumptive research while avoiding intellectual property misuse.
Researchers can build their own collections or use the worksets and derived data provided by HTRC. Through HTRC Analytics, the main portal for engaging with analysis of the corpus, researchers can work with off-the-shelf algorithms for analysis and set up a secure computing environment called a data capsule. This data capsule environment also allows researchers to import their own code, and export derived data that meets HTRC’s definitions of non-consumptive use.
Web of Science
Through the Big Ten Academic Alliance, member institutions have access to the raw data for the Clarivate Web of Science database. To aid in academic and noncommercial use of the data, the Libraries offer support including code examples and tutorials. Researchers can use the Web of Science Explorer to parse article records in the dataset and use the CHTC Recipes project to see how to work with the Web of Science Data within the Center for High Throughput Computing’s (CHTC) environment. The Libraries also offer a step-by-step tutorial on how to get started with these two projects. To begin using the dataset, you can reach out to the UW-Madison Libraries’ Library Technology Group through their technical assistance form.
Are there any other tools that you’ve found helpful in getting started with text analysis? Let us know at @UWMadRschSvcs.