[function] analyze_feature_importance() - P3chys/textmining GitHub Wiki

Function: analyze_feature_importance()

Purpose

Analyzes and displays the most important features from trained machine learning models, helping to understand which words or phrases most strongly influence sentiment classification.

Syntax

analyze_feature_importance(model, vectorizer, top_n=20)

Parameters

Parameter Type Default Description
model sklearn model Required Trained machine learning model (linear or tree-based)
vectorizer sklearn vectorizer Required Fitted vectorizer (CountVectorizer or TfidfVectorizer)
top_n int 20 Number of top features to display

Supported Model Types

Linear Models (with coef_ attribute)

  • LogisticRegression
  • LinearSVC
  • SGDClassifier
  • Ridge/Lasso classifiers

For Binary Classification:

  • Displays top positive and negative features
  • Positive coefficients indicate words associated with positive sentiment
  • Negative coefficients indicate words associated with negative sentiment

For Multi-class Classification:

  • Calculates feature importance as difference between positive and negative class coefficients
  • Handles multiple sentiment classes automatically

Tree-based Models (with feature_importances_ attribute)

  • RandomForestClassifier
  • GradientBoostingClassifier
  • XGBoostClassifier
  • DecisionTreeClassifier

Feature Importance:

  • Shows features that contribute most to classification decisions
  • Based on information gain or impurity reduction
  • All values are positive (no directional interpretation)

Output Format

Linear Models Output:

Top Positive Features:
excellent: 0.8234
amazing: 0.7891
fantastic: 0.7456
...

Top Negative Features:
terrible: -0.9123
awful: -0.8567
disappointing: -0.8234
...

Tree-based Models Output:

Top Important Features:
great: 0.0456
good: 0.0389
bad: 0.0234
...

Algorithm Details

  1. Feature Name Extraction: Uses vectorizer.get_feature_names_out() to map indices to words
  2. Linear Model Processing:
    • For binary: Uses coefficient array directly
    • For multi-class: Calculates difference between positive and negative class coefficients
    • Sorts by coefficient magnitude
  3. Tree Model Processing:
    • Uses feature_importances_ attribute directly
    • Sorts by importance value in descending order

Error Handling

  • Function assumes model has either coef_ or feature_importances_ attribute
  • No explicit error handling for unsupported model types
  • Caller responsible for ensuring model-vectorizer compatibility