[function] create_bow_features() - P3chys/textmining GitHub Wiki
Overview
Comprehensive feature extraction pipeline for text mining, implementing multiple representation methods including Bag of Words, TF-IDF, word embeddings, and dimensionality reduction techniques.
1. Bag of Words and TF-IDF Functions
create_bow_features()
Function: Purpose
Creates Bag of Words (BoW) feature representation from text data.
Syntax
create_bow_features(df, text_column='processed_text_string', max_features=5000,
ngram_range=(1, 1), min_df=2, binary=False)
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
df |
pandas.DataFrame | Required | DataFrame with preprocessed text |
text_column |
str | 'processed_text_string' | Column containing cleaned text |
max_features |
int | 5000 | Maximum vocabulary size |
ngram_range |
tuple | (1, 1) | N-gram range (min_n, max_n) |
min_df |
int | 2 | Minimum document frequency |
binary |
bool | False | Whether to use binary (presence/absence) features |
Returns
- Tuple: (feature_matrix, feature_names, vectorizer)
- feature_matrix: Sparse matrix of shape (n_samples, n_features)
- feature_names: Array of feature names (vocabulary)
- vectorizer: Fitted CountVectorizer object