Viewers comment model - Lin-Brain-Lab/video_sentiment GitHub Wiki

Viewers Comment Model

πŸ“Œ Model Description

The Viewers Comment Model is a two-stage sentiment classifier designed to predict the overall sentiment of a YouTube video based on its top-liked viewer comments.

Architecture Overview:

  1. Stage 1 – Comment-level Sentiment Classifier:

    • Uses TF-IDF vectorization on comment text.
    • Trained using XGBoost on hand-labeled comments (multi-class: 0–4).
    • Balanced using SMOTE and RandomUnderSampler in an imbalanced-learn pipeline.
  2. Stage 2 – Aggregator Classifier:

    • Extracts the top 30 most-liked comments per video.
    • Predicts each comment’s sentiment from Stage 1.
    • Counts the number of each sentiment class (0–4).
    • Uses a Random Forest to predict overall sentiment based on this count vector.

βš™οΈ How This Model Was Trained

Stage 1: XGBoost on Hand-Labeled Comments

  • Input data:
    allcomments_labled.csv 8000 of the top 50 comments on different movie clips on youtube, each hand labelled with 0 to 4.

    • Columns: text (comment), sentiment (label from 0–4)
  • Preprocessing:

    • TF-IDF vectorization with trigrams, min_df=5, max_df=0.7, 30,000 features
    • Class balancing with:
      • SMOTE for minority oversampling
      • RandomUnderSampler for majority class reduction
  • Model:
    XGBClassifier with scale_pos_weight set from computed class weights

  • Output:
    Multi-class sentiment prediction for each comment:

    • 0: Neutral
    • 1: Pleased
    • 2: Funny
    • 3: Fear
    • 4: Sad

Stage 2: Random Forest on Aggregated Sentiment Counts

  • Input data:
    trainingimproved.csv contains 100 data points and is generated from stage 1 model. It correlates a movie sentiment with an array of numbers that show how many comments of a certain emotion label exists within the top 30 most liked comments.

    • Columns: Count_0 to Count_4 (number of each sentiment type from 30 comments)
    • Actual Sentiment: final label for each video
  • Model:
    RandomForestClassifier with class_weight='balanced'

  • Output:
    Final predicted sentiment of the video

  • Evaluation:
    Performance measured manually using 120 different movie clips, out of which 87% were predicted accurately. (https://docs.google.com/spreadsheets/d/17vP7-mdsTEPYhxXXGlTWZKuioEFOPvSmzO1iR6PwhoA/edit?usp=sharing).

πŸŽ₯ Real-Time Prediction Flow

  1. Use YouTube API to fetch top comments (via commentThreads).
  2. Filter top 30 comments by like count.
  3. Predict each comment's sentiment using the Stage 1 XGBoost model.
  4. Count the frequency of each class (0–4).
  5. Feed the counts to the Stage 2 Random Forest model.
  6. Interpret and return the final sentiment (e.g., "funny", "fear", "pleased").

πŸ”§ Sentiment Mapping

Class Meaning
0 Neutral
1 Pleased
2 Funny
3 Fear
4 Sad

πŸ“¦ Example Output

🎬 Predicted sentiment for video 'Inside Out – Official Trailer': Funny