[function] extract_ngrams() - P3chys/textmining GitHub Wiki

Technical Documentation - N-grams Extraction Function

Function: extract_ngrams()

Purpose

Extracts n-grams (sequences of n consecutive tokens) from processed text tokens to create additional features for text analysis and pattern recognition.

Syntax

extract_ngrams(df, token_column='processed_text', n_values=[1, 2, 3])

Parameters

Parameter Type Default Description
df pandas.DataFrame Required DataFrame containing processed text tokens
token_column str 'processed_text' Column name containing the tokenized text (list of tokens)
n_values list of int [1, 2, 3] List of n-gram sizes to generate (e.g., [1,2,3] creates unigrams, bigrams, trigrams)

Returns

  • Type: pandas.DataFrame
  • Content: Copy of input DataFrame with additional n-gram columns

Output Column Structure

For each n-value in n_values, the function creates two columns:

  1. {n}grams - Contains tuples of n consecutive tokens
  2. {n}grams_string - Contains space-separated string versions of the n-grams

Example columns created with default n_values=[1,2,3]:

  • 1grams, 1grams_string (unigrams)
  • 2grams, 2grams_string (bigrams)
  • 3grams, 3grams_string (trigrams)

Algorithm Details

  1. DataFrame Copying: Creates a copy of input DataFrame to avoid modifying original
  2. N-gram Generation: For each n-value:
    • Uses list slicing with zip(*[tokens[i:] for i in range(n)]) to create overlapping subsequences
    • Handles edge case where token list length < n (returns empty list)
  3. String Conversion: Converts tuple n-grams to readable strings by joining tokens with spaces

Edge Cases

  • Insufficient tokens: If token list has fewer than n tokens, returns empty list for that n-gram size
  • Empty token lists: Gracefully handles empty or None token lists
  • Single token: For n=1, creates unigrams from individual tokens

Example Usage

# Example DataFrame with tokenized text
df = pd.DataFrame({
    'processed_text': [
        ['great', 'product', 'love', 'it'],
        ['not', 'good', 'quality'],
        ['amazing', 'value', 'for', 'money']
    ]
})

# Extract unigrams and bigrams
result = extract_ngrams(df, n_values=[1, 2])

# Result will have columns:
# - 1grams: [('great',), ('product',), ('love',), ('it',)]
# - 1grams_string: ['great', 'product', 'love', 'it']
# - 2grams: [('great', 'product'), ('product', 'love'), ('love', 'it')]
# - 2grams_string: ['great product', 'product love', 'love it']

Performance Considerations

  • Memory Usage: Creates multiple new columns, increasing DataFrame size
  • Time Complexity: O(m × n × t) where m=rows, n=max n-gram size, t=average tokens per row
  • Large Datasets: Consider processing in chunks for very large datasets

Integration Notes

  • Input Requirements: Expects token_column to contain lists/arrays of tokens
  • Preprocessing: Should be used after tokenization (e.g., with tokenize_text() function)
  • Downstream Analysis: Enables frequency analysis, pattern detection, and feature engineering