[function] analyze_feature_importance() - P3chys/textmining GitHub Wiki
Function: analyze_feature_importance()
Purpose
Analyzes and displays the most important features from trained machine learning models, helping to understand which words or phrases most strongly influence sentiment classification.
Syntax
analyze_feature_importance(model, vectorizer, top_n=20)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
sklearn model | Required | Trained machine learning model (linear or tree-based) |
vectorizer |
sklearn vectorizer | Required | Fitted vectorizer (CountVectorizer or TfidfVectorizer) |
top_n |
int | 20 | Number of top features to display |
Supported Model Types
Linear Models (with coef_ attribute)
- LogisticRegression
- LinearSVC
- SGDClassifier
- Ridge/Lasso classifiers
For Binary Classification:
- Displays top positive and negative features
- Positive coefficients indicate words associated with positive sentiment
- Negative coefficients indicate words associated with negative sentiment
For Multi-class Classification:
- Calculates feature importance as difference between positive and negative class coefficients
- Handles multiple sentiment classes automatically
Tree-based Models (with feature_importances_ attribute)
- RandomForestClassifier
- GradientBoostingClassifier
- XGBoostClassifier
- DecisionTreeClassifier
Feature Importance:
- Shows features that contribute most to classification decisions
- Based on information gain or impurity reduction
- All values are positive (no directional interpretation)
Output Format
Linear Models Output:
Top Positive Features:
excellent: 0.8234
amazing: 0.7891
fantastic: 0.7456
...
Top Negative Features:
terrible: -0.9123
awful: -0.8567
disappointing: -0.8234
...
Tree-based Models Output:
Top Important Features:
great: 0.0456
good: 0.0389
bad: 0.0234
...
Algorithm Details
- Feature Name Extraction: Uses
vectorizer.get_feature_names_out()to map indices to words - Linear Model Processing:
- For binary: Uses coefficient array directly
- For multi-class: Calculates difference between positive and negative class coefficients
- Sorts by coefficient magnitude
- Tree Model Processing:
- Uses
feature_importances_attribute directly - Sorts by importance value in descending order
- Uses
Error Handling
- Function assumes model has either
coef_orfeature_importances_attribute - No explicit error handling for unsupported model types
- Caller responsible for ensuring model-vectorizer compatibility