Module 3 1 Text Cleaning and Normalization - iffatAGheyas/NLP-handbook GitHub Wiki
Before any modeling, raw text must be cleaned and normalized into a consistent form. This section covers:
- Unicode Normalization & Lowercasing
- Removing HTML, URLs & Emails
- Stripping Punctuation & Digits
- Collapsing Whitespace
- (Optional) Stop-word Removal
Normalize to Unicode NFC (or NFKC) so that visually identical characters have the same codepoints, then lowercase:
import unicodedata
def normalize_unicode(text: str) -> str:
# NFC: canonical decomposition, followed by canonical composition
return unicodedata.normalize('NFC', text)
def lowercase(text: str) -> str:
return text.lower()
# Demo
raw = "Café — HELLO World! 𝒜𝓁𝓅𝒽𝒶"
clean = lowercase(normalize_unicode(raw))
print(clean)
# → "café — hello world! 𝒜𝓁𝓅𝒽𝒶"
Strip out HTML tags, hyperlinks and email addresses:
import re
def remove_html(text: str) -> str:
return re.sub(r'<[^>]+>', ' ', text)
def remove_urls_emails(text: str) -> str:
# URLs
text = re.sub(r'https?://\S+|www\.\S+', ' ', text)
# Emails
text = re.sub(r'\S+@\S+\.\S+', ' ', text)
return text
# Demo
raw = "<p>Visit https://example.com or mail [email protected]!</p>"
step = remove_urls_emails(remove_html(raw))
print(step)
# → " Visit or mail ! "
Remove all non-word characters (except spaces) and optionally digits:
def remove_punct_digits(text: str, remove_digits: bool = True) -> str:
pattern = r'[^A-Za-z0-9\s]' if remove_digits else r'[^\w\s]'
return re.sub(pattern, ' ', text)
# Demo
raw = "Call me at (555) 123-4567."
print(remove_punct_digits(raw))
# → "Call me at 555 123 4567 "
Convert multiple spaces/tabs/newlines into a single space and trim:
def collapse_whitespace(text: str) -> str:
return re.sub(r'\s+', ' ', text).strip()
# Demo
raw = "This is \n\t an example."
print(collapse_whitespace(raw))
# → "This is an example."
Filter out high-frequency words that carry little meaning:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
STOP = set(stopwords.words('english'))
def remove_stopwords(tokens: list[str]) -> list[str]:
return [t for t in tokens if t not in STOP]
# Demo
tokens = "this is an example showing the removal of stop words".split()
print(remove_stopwords(tokens))
# → ['example', 'showing', 'removal', 'stop', 'words']
Combine all steps into one function:
def clean_and_tokenize(text: str, remove_digits: bool = True, remove_sw: bool = True):
# 1. Unicode & lowercase
text = lowercase(normalize_unicode(text))
# 2. HTML, URLs, emails
text = remove_urls_emails(remove_html(text))
# 3. Punctuation/digits
text = remove_punct_digits(text, remove_digits)
# 4. Whitespace
text = collapse_whitespace(text)
# 5. Tokenize
tokens = text.split()
# 6. Stop-words
if remove_sw:
tokens = remove_stopwords(tokens)
return tokens
# Demo
raw = "<div>Hello World! Visit https://ex.com.</div>"
print(clean_and_tokenize(raw))
# → ['hello', 'world', 'visit']
Continue to 3.2 Stemming & Lemmatization