Module 3 1 Text Cleaning and Normalization - iffatAGheyas/NLP-handbook GitHub Wiki

Module 3.1: Text Cleaning & Normalization

Before any modeling, raw text must be cleaned and normalized into a consistent form. This section covers:

  1. Unicode Normalization & Lowercasing
  2. Removing HTML, URLs & Emails
  3. Stripping Punctuation & Digits
  4. Collapsing Whitespace
  5. (Optional) Stop-word Removal

1. Unicode Normalization & Lowercasing

Normalize to Unicode NFC (or NFKC) so that visually identical characters have the same codepoints, then lowercase:

import unicodedata

def normalize_unicode(text: str) -> str:
    # NFC: canonical decomposition, followed by canonical composition
    return unicodedata.normalize('NFC', text)

def lowercase(text: str) -> str:
    return text.lower()

# Demo
raw = "Café — HELLO World! 𝒜𝓁𝓅𝒽𝒶"
clean = lowercase(normalize_unicode(raw))
print(clean)
# → "café — hello world! 𝒜𝓁𝓅𝒽𝒶"

Output:

image

2. Removing HTML, URLs & Emails

Strip out HTML tags, hyperlinks and email addresses:

import re

def remove_html(text: str) -> str:
    return re.sub(r'<[^>]+>', ' ', text)

def remove_urls_emails(text: str) -> str:
    # URLs
    text = re.sub(r'https?://\S+|www\.\S+', ' ', text)
    # Emails
    text = re.sub(r'\S+@\S+\.\S+', ' ', text)
    return text

# Demo
raw = "<p>Visit https://example.com or mail [email protected]!</p>"
step = remove_urls_emails(remove_html(raw))
print(step)
# → " Visit   or mail  ! "

Output:

image

3. Stripping Punctuation & Digits

Remove all non-word characters (except spaces) and optionally digits:

def remove_punct_digits(text: str, remove_digits: bool = True) -> str:
    pattern = r'[^A-Za-z0-9\s]' if remove_digits else r'[^\w\s]'
    return re.sub(pattern, ' ', text)

# Demo
raw = "Call me at (555) 123-4567."
print(remove_punct_digits(raw))
# → "Call me at  555  123 4567 "

Output:

image

4. Collapsing Whitespace

Convert multiple spaces/tabs/newlines into a single space and trim:

def collapse_whitespace(text: str) -> str:
    return re.sub(r'\s+', ' ', text).strip()

# Demo
raw = "This   is \n\t an   example."
print(collapse_whitespace(raw))
# → "This is an example."

Output:

image

5. (Optional) Stop-word Removal

Filter out high-frequency words that carry little meaning:

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
STOP = set(stopwords.words('english'))

def remove_stopwords(tokens: list[str]) -> list[str]:
    return [t for t in tokens if t not in STOP]

# Demo
tokens = "this is an example showing the removal of stop words".split()
print(remove_stopwords(tokens))
# → ['example', 'showing', 'removal', 'stop', 'words']

Output:

image

6. Full Cleaning Pipeline

Combine all steps into one function:

def clean_and_tokenize(text: str, remove_digits: bool = True, remove_sw: bool = True):
    # 1. Unicode & lowercase
    text = lowercase(normalize_unicode(text))
    # 2. HTML, URLs, emails
    text = remove_urls_emails(remove_html(text))
    # 3. Punctuation/digits
    text = remove_punct_digits(text, remove_digits)
    # 4. Whitespace
    text = collapse_whitespace(text)
    # 5. Tokenize
    tokens = text.split()
    # 6. Stop-words
    if remove_sw:
        tokens = remove_stopwords(tokens)
    return tokens

# Demo
raw = "<div>Hello World! Visit https://ex.com.</div>"
print(clean_and_tokenize(raw))
# → ['hello', 'world', 'visit']

Output:

image

Continue to 3.2 Stemming & Lemmatization

⚠️ **GitHub.com Fallback** ⚠️