Soundtrack model - Lin-Brain-Lab/video_sentiment GitHub Wiki

Soundtrack model

Predict continuous valence and arousal scores using audio features extracted from audio files using Random Forests.


Model description

We train two separate models, one for valence (positivity) and one for arousal (energy), on audio features extracted from the DEAM dataset.

How this model was trained

Input data: DEAM dataset The input data consists of V/A (Valence-Arousal) pairs for audio clips, recorded in 0.5-second intervals, starting 15 seconds after the original music file and continuously providing V/A pairs every 0.5 seconds until 45 seconds have been reached, for a total of 60 V/A pairs per MP3 file and ~1800 MP3 files in total. Using the Python library librosa, music features were extracted from the original mp3 file and assigned to each time interval of V/A pairs, for a total of 17 features per 0.5 seconds. Finally, the random forest model was trained using 108,000 (60 × 1800) data points, with each point having its own music features and V/A pairs.

Currently, the model can output a single V/A pair for each time interval and corresponding music features uploaded. We are working on ways to streamline this process for ease of use.

Current Performance

Valence and Arousal are on a scale of -1 to 1. Further optimizing, testing, and statistical analysis will be performed.

Current benchmarks

  • Valence MSE: 0.0275, (R²: 0.4996)
  • Arousal MSE: 0.0282, (R²: 0.6524)