Module 1 2 Tokenization Regex and Rule based Methods - iffatAGheyas/NLP-handbook GitHub Wiki
Module 1.2: Tokenization – Regex & Rule-based Methods
Tokenization is the process of splitting raw text into meaningful units (“tokens”), such as words, punctuation, numbers, or special symbols. In this section, we’ll cover:
- Simple regex-based tokenization
- Enhanced regex for contractions & special tokens
- NLTK’s
word_tokenize
- Custom rule-based tokenizers
1. Simple Regex Tokenizer
A Simple Regex Tokenizer is a straightforward technique used in Natural Language Processing (NLP) to split text into individual tokens (such as words or punctuation marks) using regular expressions (regex). This method is often used as a quick and efficient way to preprocess text for further analysis.
How It Works
- Splitting Logic:
The tokenizer uses a regular expression to define what constitutes a “token.” Typically, it splits the text based on word characters (letters, digits, and underscores) and separates them from non-word characters (such as punctuation or whitespace).
Example Regex Pattern
A common pattern is \w+
, which matches sequences of word characters, or \W+
, which matches sequences of non-word characters. By splitting on these patterns, you can extract words and punctuation as separate tokens.
simple_regex_tokenizer_demo.py
import re
def simple_regex_tokenize(text):
"""
Split text into tokens of word characters or standalone non-whitespace symbols.
:param text: Input string to tokenize.
:return: List of token strings.
"""
# \w+ matches one or more word characters (letters, digits, underscore)
# [^\w\s] matches any single character that is neither a word character nor whitespace
return re.findall(r"\w+|[^\w\s]", text)
if __name__ == "__main__":
# 1. Sample text
text = "Hello, NLP world! Let's tokenize: words & punctuation."
# 2. Tokenise
tokens = simple_regex_tokenize(text)
# 3. Display results
print("Original text:")
print(text)
print("\nTokens:")
print(tokens)
Outputs
2. Enhanced Regex for Contractions & URLs
Handle contractions (e.g. “don’t”) and preserve URLs or hashtags:
# enhanced_regex_tokenizer_demo.py
import re
def enhanced_regex_tokenize(text):
"""
Tokenize text into:
- URLs (http/https)
- Twitter‐style mentions (@user)
- Hashtags (#tag)
- Words with optional contractions (e.g. don't, it's)
- Numbers (including decimals)
- Any other single non-space character (punctuation, symbols)
"""
pattern = r"""
https?://\S+ # URLs
|@[A-Za-z0-9_]+ # mentions
|#[A-Za-z0-9_]+ # hashtags
|[A-Za-z]+(?:'[A-Za-z]+)?# words with optional apostrophe
|\d+(?:\.\d+)? # integers or decimals
|[^\s\w] # any other single non-space character
"""
tokenizer = re.compile(pattern, re.VERBOSE)
return tokenizer.findall(text)
if __name__ == "__main__":
# Sample text
text = "Check out https://example.com, it's awesome! #NLP @user"
# Tokenise
tokens = enhanced_regex_tokenize(text)
# Display results
print("Original text:")
print(text)
print("\nTokens:")
print(tokens)
Outputs
word_tokenize
3. NLTK’s NLTK’s pre-trained “Punkt” tokenizer handles many edge cases:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt') # only needed once
text = "Dr. Smith isn't here. He'll come soon."
tokens = word_tokenize(text)
print(tokens)
Outputs
4. Custom Rule-Based Tokenizer
For domain-specific needs, you can build rule cascades. Example: split on whitespace, but merge multi-word keywords:
# custom_rule_based_tokenizer_demo.py
import re
# 1. Define your domain-specific multi-word keyphrases
KEYPHRASES = {
"natural language processing",
"machine learning"
}
def rule_based_tokenize(text):
"""
Tokenize text by:
1. Lowercasing and merging any defined multi-word keyphrases with underscores
2. Splitting on word characters vs non-word/whitespace (simple regex)
3. Restoring spaces in merged keyphrases
"""
# 1. Lowercase for consistent keyphrase detection
temp = text.lower()
# 2. Merge each keyphrase into a single token using underscores
for phrase in KEYPHRASES:
merged = phrase.replace(" ", "_")
temp = temp.replace(phrase, merged)
# 3. Tokenise: sequences of \w+ or any non-word non-space character
tokens = re.findall(r"\w+|[^\w\s]", temp)
# 4. Restore spaces in keyphrases and return
return [token.replace("_", " ") for token in tokens]
if __name__ == "__main__":
# Sample text
text = "Natural Language Processing and Machine Learning are fun!"
# Tokenise
tokens = rule_based_tokenize(text)
# Display results
print("Original text:")
print(text)
print("\nTokens:")
print(tokens)
Output:
Next: Continue to 1.3 Finite State Automata