Module 1 2 Tokenization Regex and Rule based Methods - iffatAGheyas/NLP-handbook GitHub Wiki

Module 1.2: Tokenization – Regex & Rule-based Methods

Tokenization is the process of splitting raw text into meaningful units (“tokens”), such as words, punctuation, numbers, or special symbols. In this section, we’ll cover:

  • Simple regex-based tokenization
  • Enhanced regex for contractions & special tokens
  • NLTK’s word_tokenize
  • Custom rule-based tokenizers

1. Simple Regex Tokenizer

A Simple Regex Tokenizer is a straightforward technique used in Natural Language Processing (NLP) to split text into individual tokens (such as words or punctuation marks) using regular expressions (regex). This method is often used as a quick and efficient way to preprocess text for further analysis.

How It Works

  • Splitting Logic:
    The tokenizer uses a regular expression to define what constitutes a “token.” Typically, it splits the text based on word characters (letters, digits, and underscores) and separates them from non-word characters (such as punctuation or whitespace).

Example Regex Pattern

A common pattern is \w+, which matches sequences of word characters, or \W+, which matches sequences of non-word characters. By splitting on these patterns, you can extract words and punctuation as separate tokens.

simple_regex_tokenizer_demo.py

import re

def simple_regex_tokenize(text):
    """
    Split text into tokens of word characters or standalone non-whitespace symbols.
    
    :param text: Input string to tokenize.
    :return: List of token strings.
    """
    # \w+ matches one or more word characters (letters, digits, underscore)
    # [^\w\s] matches any single character that is neither a word character nor whitespace
    return re.findall(r"\w+|[^\w\s]", text)

if __name__ == "__main__":
    # 1. Sample text
    text = "Hello, NLP world! Let's tokenize: words & punctuation."
    
    # 2. Tokenise
    tokens = simple_regex_tokenize(text)
    
    # 3. Display results
    print("Original text:")
    print(text)
    print("\nTokens:")
    print(tokens)

Outputs

image

2. Enhanced Regex for Contractions & URLs

Handle contractions (e.g. “don’t”) and preserve URLs or hashtags:

# enhanced_regex_tokenizer_demo.py

import re

def enhanced_regex_tokenize(text):
    """
    Tokenize text into:
      - URLs (http/https)
      - Twitter‐style mentions (@user)
      - Hashtags (#tag)
      - Words with optional contractions (e.g. don't, it's)
      - Numbers (including decimals)
      - Any other single non-space character (punctuation, symbols)
    """
    pattern = r"""
      https?://\S+             # URLs
      |@[A-Za-z0-9_]+          # mentions
      |#[A-Za-z0-9_]+          # hashtags
      |[A-Za-z]+(?:'[A-Za-z]+)?# words with optional apostrophe
      |\d+(?:\.\d+)?           # integers or decimals
      |[^\s\w]                 # any other single non-space character
    """
    tokenizer = re.compile(pattern, re.VERBOSE)
    return tokenizer.findall(text)

if __name__ == "__main__":
    # Sample text
    text = "Check out https://example.com, it's awesome! #NLP @user"
    
    # Tokenise
    tokens = enhanced_regex_tokenize(text)
    
    # Display results
    print("Original text:")
    print(text)
    print("\nTokens:")
    print(tokens)

Outputs

image

3. NLTK’s word_tokenize

NLTK’s pre-trained “Punkt” tokenizer handles many edge cases:

import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')   # only needed once
text = "Dr. Smith isn't here. He'll come soon."
tokens = word_tokenize(text)
print(tokens)

Outputs

image

4. Custom Rule-Based Tokenizer

For domain-specific needs, you can build rule cascades. Example: split on whitespace, but merge multi-word keywords:

# custom_rule_based_tokenizer_demo.py

import re

# 1. Define your domain-specific multi-word keyphrases
KEYPHRASES = {
    "natural language processing",
    "machine learning"
}

def rule_based_tokenize(text):
    """
    Tokenize text by:
      1. Lowercasing and merging any defined multi-word keyphrases with underscores
      2. Splitting on word characters vs non-word/whitespace (simple regex)
      3. Restoring spaces in merged keyphrases
    """
    # 1. Lowercase for consistent keyphrase detection
    temp = text.lower()
    
    # 2. Merge each keyphrase into a single token using underscores
    for phrase in KEYPHRASES:
        merged = phrase.replace(" ", "_")
        temp = temp.replace(phrase, merged)
    
    # 3. Tokenise: sequences of \w+ or any non-word non-space character
    tokens = re.findall(r"\w+|[^\w\s]", temp)
    
    # 4. Restore spaces in keyphrases and return
    return [token.replace("_", " ") for token in tokens]

if __name__ == "__main__":
    # Sample text
    text = "Natural Language Processing and Machine Learning are fun!"
    
    # Tokenise
    tokens = rule_based_tokenize(text)
    
    # Display results
    print("Original text:")
    print(text)
    print("\nTokens:")
    print(tokens)

Output:

image

Next: Continue to 1.3 Finite State Automata