[function] extract_ngrams() - P3chys/textmining GitHub Wiki

Technical Documentation - N-grams Extraction Function

Function: `extract_ngrams()`

Purpose

Extracts n-grams (sequences of n consecutive tokens) from processed text tokens to create additional features for text analysis and pattern recognition.

Syntax

extract_ngrams(df, token_column='processed_text', n_values=[1, 2, 3])

Parameters

Parameter	Type	Default	Description
`df`	pandas.DataFrame	Required	DataFrame containing processed text tokens
`token_column`	str	'processed_text'	Column name containing the tokenized text (list of tokens)
`n_values`	list of int	[1, 2, 3]	List of n-gram sizes to generate (e.g., [1,2,3] creates unigrams, bigrams, trigrams)

Returns

Type: pandas.DataFrame
Content: Copy of input DataFrame with additional n-gram columns

Output Column Structure

For each n-value in n_values, the function creates two columns:

{n}grams - Contains tuples of n consecutive tokens
{n}grams_string - Contains space-separated string versions of the n-grams

Example columns created with default n_values=[1,2,3]:

1grams, 1grams_string (unigrams)
2grams, 2grams_string (bigrams)
3grams, 3grams_string (trigrams)

Algorithm Details

DataFrame Copying: Creates a copy of input DataFrame to avoid modifying original
N-gram Generation: For each n-value:
- Uses list slicing with zip(*[tokens[i:] for i in range(n)]) to create overlapping subsequences
- Handles edge case where token list length < n (returns empty list)
String Conversion: Converts tuple n-grams to readable strings by joining tokens with spaces

Edge Cases

Insufficient tokens: If token list has fewer than n tokens, returns empty list for that n-gram size
Empty token lists: Gracefully handles empty or None token lists
Single token: For n=1, creates unigrams from individual tokens

Example Usage

# Example DataFrame with tokenized text
df = pd.DataFrame({
    'processed_text': [
        ['great', 'product', 'love', 'it'],
        ['not', 'good', 'quality'],
        ['amazing', 'value', 'for', 'money']
    ]
})

# Extract unigrams and bigrams
result = extract_ngrams(df, n_values=[1, 2])

# Result will have columns:
# - 1grams: [('great',), ('product',), ('love',), ('it',)]
# - 1grams_string: ['great', 'product', 'love', 'it']
# - 2grams: [('great', 'product'), ('product', 'love'), ('love', 'it')]
# - 2grams_string: ['great product', 'product love', 'love it']

Performance Considerations

Memory Usage: Creates multiple new columns, increasing DataFrame size
Time Complexity: O(m × n × t) where m=rows, n=max n-gram size, t=average tokens per row
Large Datasets: Consider processing in chunks for very large datasets

Integration Notes

Input Requirements: Expects token_column to contain lists/arrays of tokens
Preprocessing: Should be used after tokenization (e.g., with tokenize_text() function)
Downstream Analysis: Enables frequency analysis, pattern detection, and feature engineering