[function] tokenize_text() - P3chys/textmining GitHub Wiki
Function: tokenize_text()
Purpose
Converts cleaned text string into individual word tokens using NLTK's word tokenization.
Syntax
tokenize_text(text)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str | Required | Cleaned text string to be tokenized |
Returns
- Type: list
- Content: List of individual word tokens extracted from the input text
Required Import
from nltk.tokenize import word_tokenize
Functionality
- Uses NLTK's
word_tokenize()function for intelligent text splitting - Handles punctuation and contractions appropriately
- More sophisticated than simple
.split()method
Advantages over Basic Splitting
- Punctuation handling: Properly separates words from attached punctuation
- Contraction processing: Handles contractions like "don't" → ["do", "n't"]
- Language-aware: Uses linguistic rules for better word boundary detection
Usage Examples
# Basic tokenization
cleaned_text = "this product is amazing i love it"
tokens = tokenize_text(cleaned_text)
# Result: ["this", "product", "is", "amazing", "i", "love", "it"]
# With contractions (after cleaning preserves apostrophes)
text_with_contractions = "i don't think it's worth it"
tokens = tokenize_text(text_with_contractions)
# Result: ["i", "do", "n't", "think", "it", "'s", "worth", "it"]
# Apply to DataFrame column
df['review_tokens'] = df['clean_review_body'].apply(tokenize_text)
Prerequisites
- Requires NLTK library installation
- Text should be pre-cleaned using
clean_text()function for optimal results - Download required NLTK data:
nltk.download('punkt')