[function] extract_features() - P3chys/textmining GitHub Wiki
Function: extract_features()
Purpose
Main function that orchestrates the complete feature extraction pipeline.
Syntax
extract_features(df, output_prefix='amazon_reviews', save_files=True)
Pipeline Steps
- Bag of Words: Creates basic word frequency features
- TF-IDF: Creates weighted term frequency features
- LSA: Reduces TF-IDF dimensions for semantic features
- Word2Vec: Trains embeddings and creates document vectors
- Bigrams: Creates 2-gram features for phrase detection
Returns
- Tuple: (features_dict, metadata_dict)
- features_dict: Contains all feature matrices
- metadata_dict: Contains models, vectorizers, and feature names
Feature Dictionary Keys
'bow': Bag of Words features'tfidf': TF-IDF features'lsa': LSA-reduced features'word2vec': Word2Vec document vectors'bigrams': Bigram TF-IDF features
Dependencies
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from gensim.models import Word2Vec
import numpy as np
import pickle
Usage Example
# Extract all features
features, metadata = extract_features(df_processed)
# Access specific feature types
bow_matrix = features['bow']
tfidf_matrix = features['tfidf']
word2vec_vectors = features['word2vec']
# Access models and vectorizers
tfidf_vectorizer = metadata['tfidf_vectorizer']
w2v_model = metadata['w2v_model']