NLP Notebooks - haltosan/RA-python-tools GitHub Wiki

Existing Jupyter Notebooks for NLP

Over time, we've amassed a collection of NLP notebooks that function slightly differently. The main ones I (S.S.) know about are available in my work folder.

Original

This notebook is useful for lists of entries -- school directories or anything in a phonebook-style format, where every line is a separate piece of data. It runs regex on each line separately, outputting a row in a csv file for every line that matches. The columns for those rows are the named capturing groups of the regex. Non-matches are written to their own file for debugging purposes.

Multi-Line

The multi-line notebook is made for when normal OCR setting makes columns run together. For this style of column,

Name Birth City Dorm Address
Jenny Cavanaugh New York City 48 Q.
Louise Ann Walters Harmony, PA 52 P.

the lines would be output as "Jenny Cavanaugh New York City 48 Q" and "Louise Ann Walters Harmony, PA 52 P". This makes NLP difficult because the name and location are smashed together, which are hard for regex to tell apart. Are the name and city of the first line "Jenny Cavanaugh" and "New York City", or "Jenny Cavanaugh New" and "York City"? Is the second line split up "Louise Ann"/"Walters Harmony" or "Louise Ann Walters"/"Harmony"?

This notebook expects a different OCR setting (ex. Option 12) which outputs each column on its own line. The first line would instead be printed as

"Jenny Cavanaugh

New York City

48 Q."

This is useful because the text from each column is now separated by a newline, which the regex can use to tell them apart (the newlines act like the checkout dividers at the grocery store register). For the regex to do this, you must be in multiline mode, which you trigger by including the corresponding flag: re.compile('regex goes here', flags=re.MULTILINE)

This approach didn't work with the previous notebook because it runs the regex on each line individually. This notebook loads all the text in together and goes through it, finding matches as it goes. Because of this, you must make sure your regex is smart about when it starts and stops (it can mess up following lines if it goes too far or stops too early).

See the notebook itself for more documentation.

Multi-File, Multi-Line

The multi-file, multi-line notebook is similar to the original multi-line notebook, but it allows you to specify column outputs that change with each input file. For example, if you were inputting two files of college students -- one of seniors and one of sophomores -- you could have a class standing column fill with 'Senior' for all students in the first file, and 'Sophomore' for all students in the second file. It's basically a way to avoid filling in columns by hand afterward.

Splitting

The splitting notebook is made for OCR with separators between each source file's text. For example, we ran Layout Parser on a folder full of obituary images. In the output file, the text obtained from each image was separated with a "PAGE END" message. The notebook uses these separators to split the OCR output into parts -- one for each source file. It then runs regex on each part separately.

For these types of sources (obituaries written in paragraph form, etc.), it's often useful to run multiple regular expressions on each one. Instead of one gigantic regex that must account for a lot of variation (the order names and dates are mentioned, for example), you can use multiple smaller expressions to pick out pieces of information individually.

There are also different ways to run and output regex that are useful in different situations. This notebook allows you to define as many regular expressions as you'd like in three modes:

Search

Returns only the first match from each text portion. Useful when you are looking for a specific piece of information that only appears once (or which you only want to capture once). An example would be the title of an obituary or the first name mentioned in the text. Outputs the entire matched string in one column.

Findall

Returns all matches from each text portion. Useful for information you want to find every instance of. An example would be finding all names and dates in each obituary. Outputs all matches in one column as a semicolon-separated list.

Group-Separated

Uses search (so only the first match is returned), but pays attention to the named capturing groups in the regex. Useful for information with parts you wish to separate (such as the day, month, and year of the first date mentioned). Outputs the text from each named capturing group in its own column.

See the notebook for more documentation.

With spaCy

This combines spaCy's Named Entity Recognition (NER) with the techniques from the splitting notebook. It's useful for sources that vary too much for regex (see the NLP Overview page for the strengths/weaknesses of spaCy and regex).

The notebook separates the OCR output into the pieces that came from each source file. It then runs spaCy on each piece, identifying the entities mentioned in the text. Right now, it picks up the names, dates, and locations, but it can identify other types of entities as well.

After that, the notebook uses user-defined regex to look through the lists of entities and output the first matching entry. It's similar to the search and group-separated methods in the splitting notebook, but the regex is run on the entity lists instead of the text itself. The purpose of this is to utilize spaCy's ability to tell entities such as names and locations apart, while picking out the useful information in its lists (spaCy returns a mix of wanted and unwanted information that must be sorted through).

See the notebook for more documentation.