Transcript model - Lin-Brain-Lab/video_sentiment GitHub Wiki

Transcript Model

📌 Model Description

We are using an SBERT-powered One-vs-Rest sentiment classification model designed for sentence-level emotion recognition from dialogue transcripts. This approach combines semantic embeddings, SMOTE for class balancing, XGBoost classifiers, and logit-based class bias adjustments to handle extreme class imbalance and maximize both overall and per-class performance.

Key Features:

Sentence-BERT (SBERT) for semantic embeddings
SMOTE oversampling for underrepresented emotion classes (1, 2, 3, 5, 6)
One-vs-Rest strategy using XGBoost with calibrated probabilities
Logit scaling to adjust confidence in rare classes at prediction time

⚙️ How This Model Was Trained

Input Data

dd_train_best.csv and dd_test_best.csv extracted from the Daily Dialogue (https://www.kaggle.com/datasets/thedevastator/dailydialog-unlock-the-conversation-potential-in) dataset.
Dataset contains 43,000 sentences for training and 4,300 for testing.
Dialogues of all "0" emotions were removed from the dataset to mqaintain class balance.
Columns used:
- "dialog" – List of dialogue turns per scene (as stringified list)
- "emotion" – List of emotion labels (0–6) matching the dialogue turns

Preprocessing Steps

Removed rows with all-zero emotion labels
Cleaned punctuation and whitespace
Annotated different sentence changes within a dialogue
Flattened each dialogue into sentence-level data
Applied SMOTE on minority classes before training

Output Data

Emotion prediction for each sentence:
- 0: Neutral
- 1: Joy
- 2: Sadness
- 3: Anger
- 4: Fear
- 5: Disgust
- 6: Surprise

📈 Performance

Final evaluation on test set (4,300 samples):

Class	Precision	Recall	F1-score	Support
0	0.79	0.94	0.86	2981
1	0.69	0.11	0.18	104
2	0.78	0.16	0.26	45
3	1.00	0.38	0.56	13
4	0.74	0.53	0.62	956
5	0.75	0.06	0.11	99
6	0.71	0.12	0.20	102

Summary:

🎯 Overall accuracy: 78% over ~4,300 sentences.
📊 Macro F1-score: 0.40
📊 Weighted F1-score: 0.75

The model significantly improves the performance of minority classes while preserving high accuracy on the majority class (0: Neutral). We acknowledge the poor F1-scores for the classes and are currently working towards a fix experimenting with other sampling techniques to finetune the model further. We are in progress of developing testable movie transcripts to feed into the model so we can manually test for accuracy.