[function] analyze_feature_importance() - P3chys/textmining GitHub Wiki

Function: `analyze_feature_importance()`

Purpose

Analyzes and displays the most important features from trained machine learning models, helping to understand which words or phrases most strongly influence sentiment classification.

Syntax

analyze_feature_importance(model, vectorizer, top_n=20)

Parameters

Parameter	Type	Default	Description
`model`	sklearn model	Required	Trained machine learning model (linear or tree-based)
`vectorizer`	sklearn vectorizer	Required	Fitted vectorizer (CountVectorizer or TfidfVectorizer)
`top_n`	int	20	Number of top features to display

Supported Model Types

Linear Models (with `coef_` attribute)

LogisticRegression
LinearSVC
SGDClassifier
Ridge/Lasso classifiers

For Binary Classification:

Displays top positive and negative features
Positive coefficients indicate words associated with positive sentiment
Negative coefficients indicate words associated with negative sentiment

For Multi-class Classification:

Calculates feature importance as difference between positive and negative class coefficients
Handles multiple sentiment classes automatically

Tree-based Models (with `feature_importances_` attribute)

RandomForestClassifier
GradientBoostingClassifier
XGBoostClassifier
DecisionTreeClassifier

Feature Importance:

Shows features that contribute most to classification decisions
Based on information gain or impurity reduction
All values are positive (no directional interpretation)

Output Format

Linear Models Output:

Top Positive Features:
excellent: 0.8234
amazing: 0.7891
fantastic: 0.7456
...

Top Negative Features:
terrible: -0.9123
awful: -0.8567
disappointing: -0.8234
...

Tree-based Models Output:

Top Important Features:
great: 0.0456
good: 0.0389
bad: 0.0234
...

Algorithm Details

Feature Name Extraction: Uses vectorizer.get_feature_names_out() to map indices to words
Linear Model Processing:
- For binary: Uses coefficient array directly
- For multi-class: Calculates difference between positive and negative class coefficients
- Sorts by coefficient magnitude
Tree Model Processing:
- Uses feature_importances_ attribute directly
- Sorts by importance value in descending order

Error Handling

Function assumes model has either coef_ or feature_importances_ attribute
No explicit error handling for unsupported model types
Caller responsible for ensuring model-vectorizer compatibility