[function] extract_ngrams() - P3chys/textmining GitHub Wiki
Technical Documentation - N-grams Extraction Function
extract_ngrams()
Function: Purpose
Extracts n-grams (sequences of n consecutive tokens) from processed text tokens to create additional features for text analysis and pattern recognition.
Syntax
extract_ngrams(df, token_column='processed_text', n_values=[1, 2, 3])
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
df |
pandas.DataFrame | Required | DataFrame containing processed text tokens |
token_column |
str | 'processed_text' | Column name containing the tokenized text (list of tokens) |
n_values |
list of int | [1, 2, 3] | List of n-gram sizes to generate (e.g., [1,2,3] creates unigrams, bigrams, trigrams) |
Returns
- Type: pandas.DataFrame
- Content: Copy of input DataFrame with additional n-gram columns
Output Column Structure
For each n-value in n_values
, the function creates two columns:
{n}grams
- Contains tuples of n consecutive tokens{n}grams_string
- Contains space-separated string versions of the n-grams
Example columns created with default n_values=[1,2,3]:
1grams
,1grams_string
(unigrams)2grams
,2grams_string
(bigrams)3grams
,3grams_string
(trigrams)
Algorithm Details
- DataFrame Copying: Creates a copy of input DataFrame to avoid modifying original
- N-gram Generation: For each n-value:
- Uses list slicing with
zip(*[tokens[i:] for i in range(n)])
to create overlapping subsequences - Handles edge case where token list length < n (returns empty list)
- Uses list slicing with
- String Conversion: Converts tuple n-grams to readable strings by joining tokens with spaces
Edge Cases
- Insufficient tokens: If token list has fewer than n tokens, returns empty list for that n-gram size
- Empty token lists: Gracefully handles empty or None token lists
- Single token: For n=1, creates unigrams from individual tokens
Example Usage
# Example DataFrame with tokenized text
df = pd.DataFrame({
'processed_text': [
['great', 'product', 'love', 'it'],
['not', 'good', 'quality'],
['amazing', 'value', 'for', 'money']
]
})
# Extract unigrams and bigrams
result = extract_ngrams(df, n_values=[1, 2])
# Result will have columns:
# - 1grams: [('great',), ('product',), ('love',), ('it',)]
# - 1grams_string: ['great', 'product', 'love', 'it']
# - 2grams: [('great', 'product'), ('product', 'love'), ('love', 'it')]
# - 2grams_string: ['great product', 'product love', 'love it']
Performance Considerations
- Memory Usage: Creates multiple new columns, increasing DataFrame size
- Time Complexity: O(m × n × t) where m=rows, n=max n-gram size, t=average tokens per row
- Large Datasets: Consider processing in chunks for very large datasets
Integration Notes
- Input Requirements: Expects
token_column
to contain lists/arrays of tokens - Preprocessing: Should be used after tokenization (e.g., with
tokenize_text()
function) - Downstream Analysis: Enables frequency analysis, pattern detection, and feature engineering