NLP Overview - haltosan/RA-python-tools Wiki

Explanation

For our purposes, NLP (Natural Language Processing) is using computer programs to get useful information out of sources, such as automatically finding names and dates in obituaries. This process can be difficult--while it's easy to manually pick out a person's name, it's harder to write code that can do it for you.

There are a few strategies we use to find as much useful data as possible:

Regex

Regex (short for 'regular expression') is basically defining a pattern for the program to look for in text. You can tell it to look for 4 digit numbers (probably a year, like '1966'), two capitalized words next to each other (probably a name, like 'John Doe'), and so forth.

Here are some good regex reference pages:

For how to write regular expressions in Python: Regular Expression HOWTO (Python 3 docs)

For how to use the Python re library (which has everything for regex): re — Regular expression operations (Python 3 docs)

For testing your regex: Regex 101

Our own page on regex: wiki

NER

NER (Named Entity Recognition) is identifying entities in text, like people, geographic locations, dates, etc. There are Python libraries available online that do this already (see the spaCy page). They're easy to use if you stick with pre-trained models, which work well enough in many situations.

What's the difference?

A library with NER is good at finding most of the information we're looking for in a wide variety of circumstances. For instance, it picks out almost all the people and dates out of an obituary, while correctly identifying other proper nouns as locations or organizations instead of people. (Most of the time...it's far from perfect.) However, it's harder to personalize or fine-tune for specific instances--it just runs and spits out the entities it finds. Unless you train a model yourself, you have to make do with what it gives you.

Regex is almost the exact opposite. It's easy to write expressions that do a very specific thing and change them as necessary. If you only want to capture information that has a specific format (like a full date, for instance) regex is reliable and wonderful. It's difficult to generalize with, however. For example, an expression that would find 'John Doe' will miss 'John R. Doe' if you forget to account for initials with periods. The same regex might also identify 'Mount Vernon' or 'On Tuesday' as a name--both are two capitalized words, after all.

One approach is to let NER do the initial finding and categorizing, then use regex to filter out the entities you want. The current obituary notebook works this way.