[function] create_bow_features() - P3chys/textmining GitHub Wiki

Overview

Comprehensive feature extraction pipeline for text mining, implementing multiple representation methods including Bag of Words, TF-IDF, word embeddings, and dimensionality reduction techniques.


1. Bag of Words and TF-IDF Functions

Function: create_bow_features()

Purpose

Creates Bag of Words (BoW) feature representation from text data.

Syntax

create_bow_features(df, text_column='processed_text_string', max_features=5000,
                   ngram_range=(1, 1), min_df=2, binary=False)

Parameters

Parameter Type Default Description
df pandas.DataFrame Required DataFrame with preprocessed text
text_column str 'processed_text_string' Column containing cleaned text
max_features int 5000 Maximum vocabulary size
ngram_range tuple (1, 1) N-gram range (min_n, max_n)
min_df int 2 Minimum document frequency
binary bool False Whether to use binary (presence/absence) features

Returns

  • Tuple: (feature_matrix, feature_names, vectorizer)
  • feature_matrix: Sparse matrix of shape (n_samples, n_features)
  • feature_names: Array of feature names (vocabulary)
  • vectorizer: Fitted CountVectorizer object