spaCy - haltosan/RA-python-tools GitHub Wiki

Overview

spaCy is a Python library that can be used for NLP. It has a lot of capabilities we haven't tried yet (see the features section on their spaCy 101 page), but so far we've used it for Named Entity Recognition, or NER. spaCy has trained models that can pick out entities, such as people, locations, dates, etc., in a given text.

It's similar to regex in that we can extract information from sources like obituaries, but there are some key differences to consider when using it for a project. In general, spaCy appears to be better at free-form text that doesn't follow specific patterns, such as obituaries or anything in paragraph form. If the source does have a strict and predictable format, like a directory or school register, regex would likely perform better. (spaCy is not perfect at differentiating a name from a location, but if you know the location always follows the name on each line, you could tell them apart easily with regex.)

It is possible to train a spaCy model and improve accuracy, but we haven't experimented with this yet.

Get

The spaCy website has some very nice pages with all the commands you need to set things up, so the links are provided:

Install

It's already installed in the record_linking environment, but if you need it again: Install spaCy

Download Trained Models

"en_core_web_sm" is already downloaded for the record_linking environment. For anything else: Trained Models and Pipelines

Use

If you're using the "en_core_web_sm" model, this is all you need:

Import

import spacy

Load

nlp = spacy.load("en_core_web_sm")

Or, for only NER, nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

Use in Program

Here's an example of how to use spaCy in a program. This function returns lists of names, dates, and locations found in the input text.

def run_spacy(text): # where 'text' holds the input text

    # lists that hold the named entities in the obituary
    names = []
    dates = []
    locations = []

    # this line runs spaCy on the text and stores what it finds in the variable 'matches'
    matches = nlp(text)

    for word in matches.ents:
        if word.label_ == 'PERSON':
            names.append(word.text)
        elif word.label_ == 'DATE':
            dates.append(word.text)
        elif word.label_ == 'GPE':
            locations.append(word.text)
        
    return names, dates, locations

For documentation about the EntityRecognizer class and its attributes, visit the page EntityRecognizer.