Viewers comment model - Lin-Brain-Lab/video_sentiment GitHub Wiki
Viewers Comment Model
π Model Description
The Viewers Comment Model is a two-stage sentiment classifier designed to predict the overall sentiment of a YouTube video based on its top-liked viewer comments.
Architecture Overview:
-
Stage 1 β Comment-level Sentiment Classifier:
- Uses TF-IDF vectorization on comment text.
- Trained using XGBoost on hand-labeled comments (multi-class: 0β4).
- Balanced using SMOTE and RandomUnderSampler in an imbalanced-learn pipeline.
-
Stage 2 β Aggregator Classifier:
- Extracts the top 30 most-liked comments per video.
- Predicts each commentβs sentiment from Stage 1.
- Counts the number of each sentiment class (0β4).
- Uses a Random Forest to predict overall sentiment based on this count vector.
βοΈ How This Model Was Trained
Stage 1: XGBoost on Hand-Labeled Comments
-
Input data:
allcomments_labled.csv
8000 of the top 50 comments on different movie clips on youtube, each hand labelled with 0 to 4.- Columns:
text
(comment),sentiment
(label from 0β4)
- Columns:
-
Preprocessing:
- TF-IDF vectorization with trigrams, min_df=5, max_df=0.7, 30,000 features
- Class balancing with:
SMOTE
for minority oversamplingRandomUnderSampler
for majority class reduction
-
Model:
XGBClassifier
withscale_pos_weight
set from computed class weights -
Output:
Multi-class sentiment prediction for each comment:0
: Neutral1
: Pleased2
: Funny3
: Fear4
: Sad
Stage 2: Random Forest on Aggregated Sentiment Counts
-
Input data:
trainingimproved.csv
contains 100 data points and is generated from stage 1 model. It correlates a movie sentiment with an array of numbers that show how many comments of a certain emotion label exists within the top 30 most liked comments.- Columns:
Count_0
toCount_4
(number of each sentiment type from 30 comments) Actual Sentiment
: final label for each video
- Columns:
-
Model:
RandomForestClassifier
withclass_weight='balanced'
-
Output:
Final predicted sentiment of the video -
Evaluation:
Performance measured manually using 120 different movie clips, out of which 87% were predicted accurately. (https://docs.google.com/spreadsheets/d/17vP7-mdsTEPYhxXXGlTWZKuioEFOPvSmzO1iR6PwhoA/edit?usp=sharing).
π₯ Real-Time Prediction Flow
- Use YouTube API to fetch top comments (via
commentThreads
). - Filter top 30 comments by like count.
- Predict each comment's sentiment using the Stage 1 XGBoost model.
- Count the frequency of each class (0β4).
- Feed the counts to the Stage 2 Random Forest model.
- Interpret and return the final sentiment (e.g.,
"funny"
,"fear"
,"pleased"
).
π§ Sentiment Mapping
Class | Meaning |
---|---|
0 | Neutral |
1 | Pleased |
2 | Funny |
3 | Fear |
4 | Sad |
π¦ Example Output
π¬ Predicted sentiment for video 'Inside Out β Official Trailer': Funny