Module 5 Named Entity Recognition and Sequence Labeling - iffatAGheyas/applied-nlp-handbook GitHub Wiki

Module 5: Named Entity Recognition & Sequence Labeling

This module covers methods for identifying and labeling named entities (e.g. persons, locations, organizations) in text and more general sequence‐labeling tasks.

5.1 spaCy Pretrained NER

Use spaCy’s small English model to extract entities out of the box:

import spacy

# 1. Load the spaCy model (install with: pip install spacy && python -m spacy download en_core_web_sm)
nlp = spacy.load('en_core_web_sm')

# 2. Process text
text = "Apple is looking at buying U.K. startup for $1 billion"
doc  = nlp(text)

# 3. Print detected entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Output:

Apple ORG
U.K. GPE
$1 billion MONEY

5.2 CRF-Based Sequence Labeling

Train a Conditional Random Field on toy NER data:

import sklearn_crfsuite
from sklearn_crfsuite import metrics
from collections import defaultdict

# 1. Toy labeled data: list of sentences, each a list of (word, tag)
train_sents = [
    [('John','B-PER'), ('lives','O'), ('in','O'), ('London','B-LOC')],
    [('Mary','B-PER'), ('works','O'), ('at','O'), ('Google','B-ORG')],
]
test_sents = [
    [('Alice','B-PER'), ('visited','O'), ('Paris','B-LOC')],
]

# 2. Feature functions (reuse from Module 4.4)
def word2features(sent, i):
    token = sent[i][0] if isinstance(sent[i], tuple) else sent[i]
    feats = {
        'bias': 1.0,
        'word.lower()': token.lower(),
        'word.istitle()': token.istitle(),
        'word.isupper()': token.isupper(),
    }
    if i > 0:
        prev = sent[i-1][0] if isinstance(sent[i-1], tuple) else sent[i-1]
        feats.update({'-1:word.lower()': prev.lower()})
    else:
        feats['BOS'] = True
    if i < len(sent)-1:
        nxt = sent[i+1][0] if isinstance(sent[i+1], tuple) else sent[i+1]
        feats.update({'+1:word.lower()': nxt.lower()})
    else:
        feats['EOS'] = True
    return feats

def sent2features(sent): return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent):   return [tag for (_, tag) in sent]

X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s)   for s in train_sents]
X_test  = [sent2features(s) for s in test_sents]
y_test  = [sent2labels(s)   for s in test_sents]

# 3. Train CRF
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', c1=0.1, c2=0.1,
    max_iterations=50, all_possible_transitions=True
)
crf.fit(X_train, y_train)

# 4. Predict and evaluate
y_pred = crf.predict(X_test)
print("Input:", [w for w,_ in test_sents[0]])
print("Predicted:", y_pred[0])

Output:

Input: ['Alice', 'visited', 'Paris']
Predicted: ['B-PER' 'O' 'B-LOC']

5.3 Transformer-Based NER with Hugging Face

Use a pretrained BERT‐style model for NER:

from transformers import pipeline

# 1. Explicitly specify model/tokenizer & aggregation strategy
ner = pipeline(
    task="ner",
    model="dbmdz/bert-large-cased-finetuned-conll03-english",
    tokenizer="dbmdz/bert-large-cased-finetuned-conll03-english",
    aggregation_strategy="simple",
    framework="pt",    # use PyTorch backend
    device=-1          # CPU
)

# 2. Run the pipeline
text = "Amazon founder Jeff Bezos visited Berlin University."
entities = ner(text)

# 3. Display results
for ent in entities:
    word  = ent['word']
    label = ent['entity_group']
    span  = f"{ent['start']}-{ent['end']}"
    print(f"{word:20} {label:6} {span}")

Output:

Amazon               ORG    0-6
Jeff Bezos           PER    15-25
Berlin University    ORG    34-51

5.4 Evaluation with seqeval

Compute precision/recall/F₁ for sequence labels:

# 1. Install seqeval (only needs to run once per kernel)
%pip install --quiet seqeval

# 2. Import
from seqeval.metrics import classification_report

# --- 
# At this point you should have:
#   y_test:  a list of sequences, e.g. [['B-PER','O','B-LOC'], ['O','B-ORG',...], ...]
#   y_pred:  a list (or numpy array) with the same shape, containing predicted tags.

# If your y_test / y_pred are numpy arrays, convert them to lists of lists of strings:
# For example, if y_test is an array of shape (n_sents, max_len), do:
# y_test_list = [list(seq) for seq in y_test]
# y_pred_list = [list(seq) for seq in y_pred]

# But if they’re already Python lists of lists, just rename:
y_test_list = [list(seq) for seq in y_test]
y_pred_list = [list(seq) for seq in y_pred]

# 3. Print the seqeval classification report
print(classification_report(
    y_test_list,
    y_pred_list,
    zero_division='0'   # fills 0.0 for any undefined precision/recall instead of error
))