Viewers comment model - Lin-Brain-Lab/video_sentiment GitHub Wiki

Viewers Comment Model

📌 Model Description

The Viewers Comment Model is a two-stage sentiment classifier designed to predict the overall sentiment of a YouTube video based on its top-liked viewer comments.

Architecture Overview:

Stage 1 – Comment-level Sentiment Classifier:
- Uses TF-IDF vectorization on comment text.
- Trained using XGBoost on hand-labeled comments (multi-class: 0–4).
- Balanced using SMOTE and RandomUnderSampler in an imbalanced-learn pipeline.
Stage 2 – Aggregator Classifier:
- Extracts the top 30 most-liked comments per video.
- Predicts each comment’s sentiment from Stage 1.
- Counts the number of each sentiment class (0–4).
- Uses a Random Forest to predict overall sentiment based on this count vector.

⚙️ How This Model Was Trained

Stage 1: XGBoost on Hand-Labeled Comments

Input data:
allcomments_labled.csv 8000 of the top 50 comments on different movie clips on youtube, each hand labelled with 0 to 4.
- Columns: text (comment), sentiment (label from 0–4)
Preprocessing:
- TF-IDF vectorization with trigrams, min_df=5, max_df=0.7, 30,000 features
- Class balancing with:
  - SMOTE for minority oversampling
  - RandomUnderSampler for majority class reduction
Model:
XGBClassifier with scale_pos_weight set from computed class weights
Output:
Multi-class sentiment prediction for each comment:
- 0: Neutral
- 1: Pleased
- 2: Funny
- 3: Fear
- 4: Sad

Stage 2: Random Forest on Aggregated Sentiment Counts

Input data:
trainingimproved.csv contains 100 data points and is generated from stage 1 model. It correlates a movie sentiment with an array of numbers that show how many comments of a certain emotion label exists within the top 30 most liked comments.
- Columns: Count_0 to Count_4 (number of each sentiment type from 30 comments)
- Actual Sentiment: final label for each video
Model:
RandomForestClassifier with class_weight='balanced'
Output:
Final predicted sentiment of the video
Evaluation:
Performance measured manually using 120 different movie clips, out of which 87% were predicted accurately. (https://docs.google.com/spreadsheets/d/17vP7-mdsTEPYhxXXGlTWZKuioEFOPvSmzO1iR6PwhoA/edit?usp=sharing).

🎥 Real-Time Prediction Flow

Use YouTube API to fetch top comments (via commentThreads).
Filter top 30 comments by like count.
Predict each comment's sentiment using the Stage 1 XGBoost model.
Count the frequency of each class (0–4).
Feed the counts to the Stage 2 Random Forest model.
Interpret and return the final sentiment (e.g., "funny", "fear", "pleased").

🔧 Sentiment Mapping

Class	Meaning
0	Neutral
1	Pleased
2	Funny
3	Fear
4	Sad

📦 Example Output

🎬 Predicted sentiment for video 'Inside Out – Official Trailer': Funny