Natural Language Processing - Research Data Services

Natural language processing, or NLP, refers to computer aided analysis of text data. Because text data is typically generated by humans and intended for human audiences, it poses unique challenges for computational analysis. All the contextual information humans use to understand the ambiguities in one another’s speech are very difficult to reproduce for computers. However, recent advances in NLP tools and related technologies such as artificial intelligence and machine learning have not only made them extremely sophisticated and powerful, but also far more accessible for researchers in all disciplines.

Applications for NLP

NLP has applications across disciplines. Researchers working in the natural and social sciences or the humanities, or, really, anyone working with large amounts of text data, can benefit from these techniques. Below are some of its primary capabilities:

Text mining and data extraction
Sentiment analysis
Grammatical and linguistic analysis
Named entity recognition
Translation
Stylometry
Question answering

Programming Languages:

NLP tools are most frequently deployed using programming languages that offer highly customizable and powerful ways to analyze text data. There are multiple languages you can use and, chances are, if you have some programming experience, you can find packages that you can use for text analysis. A few of the more popular NLP packages are:

Stanford CoreNLP: A Java package that is among the most sophisticated and highly developed NLP tools available. Among other capabilities, it features award winning grammar, named entity recognition, and part of speech tagger as well as the ability to work with Arabic, Chinese, French, German, and Spanish text data. The package is freely available for download, but does require some Java knowledge.

The Natural Language Toolkit: The NLTK is the most widely-used Python library for natural language analysis. The NLTK allows you to tokenize parts of speech, produce parsing trees, identify named entities, and much more. If your work focuses on literary or historical texts there is also NLTK Book, which is a widely used and well-developed package for digital humanities research. It features rich tutorials and sophisticated grammar and sentence structure analysis tools.

spaCy: Another Python library that performs many of the same functions as the NLTK, but optimized for speed and memory and with a focus on productivity and fast results.

If you’re familiar with R, there are also many tools you can use for NLP tasks. The R tools are less centralized than the Java and Python packages, but this overview from the Comprehensive R Archive Network (CRAN) is a great place to start to find the particular NLP tools you can use within R.

It’s also worth noting that if you’re familiar with one of these languages but want to use the tools available in another, it is often possible to find ways to connect the language with the tool with a little extra research and work.

Non-Programming Tool:

Recogito: A digital humanities tool that, in addition to a host of other useful features, offers push-of-the-button named entity recognition built on Stanford’s CoreNLP. While it doesn’t have other NLP features yet, the NER outputs can be easily exported in a variety of ways including familiar formats like csv and JSON, but also RDF linked data, TEI, and, excitingly, a work in progress Spacy JSON option that would prepare the data for more sophisticated analysis using the spaCy machine learning Python package mentioned above.

Research Data Services (RDS) is an interdisciplinary organization committed to advancing research data management practice on the UW-Madison campus. We focus on providing researchers with the tools and resources that support their efforts to store, analyze and share data.