[function] tokenize_text() - P3chys/textmining GitHub Wiki

Function: tokenize_text()

Purpose

Converts cleaned text string into individual word tokens using NLTK's word tokenization.

Syntax

tokenize_text(text)

Parameters

Parameter Type Default Description
text str Required Cleaned text string to be tokenized

Returns

  • Type: list
  • Content: List of individual word tokens extracted from the input text

Required Import

from nltk.tokenize import word_tokenize

Functionality

  • Uses NLTK's word_tokenize() function for intelligent text splitting
  • Handles punctuation and contractions appropriately
  • More sophisticated than simple .split() method

Advantages over Basic Splitting

  • Punctuation handling: Properly separates words from attached punctuation
  • Contraction processing: Handles contractions like "don't" → ["do", "n't"]
  • Language-aware: Uses linguistic rules for better word boundary detection

Usage Examples

# Basic tokenization
cleaned_text = "this product is amazing i love it"
tokens = tokenize_text(cleaned_text)
# Result: ["this", "product", "is", "amazing", "i", "love", "it"]

# With contractions (after cleaning preserves apostrophes)
text_with_contractions = "i don't think it's worth it"
tokens = tokenize_text(text_with_contractions)
# Result: ["i", "do", "n't", "think", "it", "'s", "worth", "it"]

# Apply to DataFrame column
df['review_tokens'] = df['clean_review_body'].apply(tokenize_text)

Prerequisites

  • Requires NLTK library installation
  • Text should be pre-cleaned using clean_text() function for optimal results
  • Download required NLTK data: nltk.download('punkt')