[function] extract_features() - P3chys/textmining GitHub Wiki

Function: extract_features()

Purpose

Main function that orchestrates the complete feature extraction pipeline.

Syntax

extract_features(df, output_prefix='amazon_reviews', save_files=True)

Pipeline Steps

  1. Bag of Words: Creates basic word frequency features
  2. TF-IDF: Creates weighted term frequency features
  3. LSA: Reduces TF-IDF dimensions for semantic features
  4. Word2Vec: Trains embeddings and creates document vectors
  5. Bigrams: Creates 2-gram features for phrase detection

Returns

  • Tuple: (features_dict, metadata_dict)
  • features_dict: Contains all feature matrices
  • metadata_dict: Contains models, vectorizers, and feature names

Feature Dictionary Keys

  • 'bow': Bag of Words features
  • 'tfidf': TF-IDF features
  • 'lsa': LSA-reduced features
  • 'word2vec': Word2Vec document vectors
  • 'bigrams': Bigram TF-IDF features

Dependencies

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from gensim.models import Word2Vec
import numpy as np
import pickle

Usage Example

# Extract all features
features, metadata = extract_features(df_processed)

# Access specific feature types
bow_matrix = features['bow']
tfidf_matrix = features['tfidf']
word2vec_vectors = features['word2vec']

# Access models and vectorizers
tfidf_vectorizer = metadata['tfidf_vectorizer']
w2v_model = metadata['w2v_model']