Transcript model - Lin-Brain-Lab/video_sentiment GitHub Wiki

Transcript Model

📌 Model Description

We are using an SBERT-powered One-vs-Rest sentiment classification model designed for sentence-level emotion recognition from dialogue transcripts. This approach combines semantic embeddings, SMOTE for class balancing, XGBoost classifiers, and logit-based class bias adjustments to handle extreme class imbalance and maximize both overall and per-class performance.

Key Features:

  • Sentence-BERT (SBERT) for semantic embeddings
  • SMOTE oversampling for underrepresented emotion classes (1, 2, 3, 5, 6)
  • One-vs-Rest strategy using XGBoost with calibrated probabilities
  • Logit scaling to adjust confidence in rare classes at prediction time

⚙️ How This Model Was Trained

Input Data

  • dd_train_best.csv and dd_test_best.csv extracted from the Daily Dialogue (https://www.kaggle.com/datasets/thedevastator/dailydialog-unlock-the-conversation-potential-in) dataset.
  • Dataset contains 43,000 sentences for training and 4,300 for testing.
  • Dialogues of all "0" emotions were removed from the dataset to mqaintain class balance.
  • Columns used:
    • "dialog" – List of dialogue turns per scene (as stringified list)
    • "emotion" – List of emotion labels (0–6) matching the dialogue turns

Preprocessing Steps

  • Removed rows with all-zero emotion labels
  • Cleaned punctuation and whitespace
  • Annotated different sentence changes within a dialogue
  • Flattened each dialogue into sentence-level data
  • Applied SMOTE on minority classes before training

Output Data

  • Emotion prediction for each sentence:
    • 0: Neutral
    • 1: Joy
    • 2: Sadness
    • 3: Anger
    • 4: Fear
    • 5: Disgust
    • 6: Surprise

📈 Performance

Final evaluation on test set (4,300 samples):

Class Precision Recall F1-score Support
0 0.79 0.94 0.86 2981
1 0.69 0.11 0.18 104
2 0.78 0.16 0.26 45
3 1.00 0.38 0.56 13
4 0.74 0.53 0.62 956
5 0.75 0.06 0.11 99
6 0.71 0.12 0.20 102

Summary:

  • 🎯 Overall accuracy: 78% over ~4,300 sentences.
  • 📊 Macro F1-score: 0.40
  • 📊 Weighted F1-score: 0.75

The model significantly improves the performance of minority classes while preserving high accuracy on the majority class (0: Neutral). We acknowledge the poor F1-scores for the classes and are currently working towards a fix experimenting with other sampling techniques to finetune the model further. We are in progress of developing testable movie transcripts to feed into the model so we can manually test for accuracy.