Configurations - thuiar/MMSA-FET GitHub Wiki

Config files in MMSA-FET are json files containing 3 sections: audio, video and text. You can extract unimodal features as well as multimodal features, just remove any section you don't need.

An overall template is shown here. In this page, detailed configurations for each tool is introduced.

1. Audio Tools

1.1 Librosa

{
  "audio": {
    "tool": "librosa",
    "sample_rate": null,        // null means auto detect
    "args": {
      "mfcc": {
        "n_mfcc": 20,
        "htk": true
      },
      "rms": {},                // remove this line if you don't need rms feature
      "zero_crossing_rate": {}, // remove this line if you don't need zero_crossing_rate feature
      "spectral_rolloff": {},   // remove this line if you don't need spectral_rolloff feature
      "spectral_centroid": {}   // remove this line if you don't need spectral_centroid feature

      // add more features to your need here. 
      // supported features are listed on this page: https://librosa.org/doc/latest/feature.html
    }
  }
}

1.2 openSMILE

{
  "audio": {
    "tool": "opensmile",
    "sample_rate": 16000,             // opensmile uses 16000 bitrate
    "args": {
      "feature_set": "eGeMAPS",       // opensmile feature sets: https://audeering.github.io/opensmile-python/api-smile.html#featureset
      "feature_level": "Functionals", // opensmile config: https://audeering.github.io/opensmile-python/api-smile.html#featurelevel
      "start": null,                  // passed to opensmile.process_signal: 
      "end": null                     // https://audeering.github.io/opensmile-python/api-smile.html#opensmile.Smile.process_signal
    }
  }
}

1.3 Wav2vec2

{
  "audio": {
    "tool": "wav2vec",
    "sample_rate": 16000,                       // better use the same sample rate as the pretrained model
    "pretrained": "facebook/wav2vec2-base-960h" // pretrained model name passed to huggingface transformers
  }
}

2. Video Tools

2.1 OpenFace

{
  "video": {
    "tool": "openface",      // features are discribed here: https://github.com/TadasBaltrusaitis/OpenFace/wiki/Output-Format
    "fps": 25,               // how many images to generate in one second of the video
    "multiFace": {           // TalkNet ASD configs
      "enable": false,       // enable ASD
      "facedetScale": 0.25,
      "minTrack": 10,
      "numFailedDet": 10,
      "minFaceSize": 1,
      "cropScale": 0.4
    },
    "average_over": 1,       // average results of n frames
    "args": {
      "hogalign": false,     // generate HOG binary file (cannot be used by MMSA)
      "simalign": false,     // generate images of aligned faces (for visualization)
      "nobadaligned": false, // don't generate images for faces that are badly aligned, will save disk space
      "landmark_2D": true,   // facial landmarks in 2D
      "landmark_3D": false,  // facial landmarks in 3D
      "pdmparams": false,    // rigid face shape (location, scale and rotation) and non-rigid face shape (deformation due to expression and identity)
      "head_pose": true,     // head pose
      "action_units": true,  // action units: https://github.com/TadasBaltrusaitis/OpenFace/wiki/Action-Units
      "gaze": true,          // gaze related features
      "tracked": false       // output video with detected landmarks
    }
  }
}

2.2 Mediapipe

{
  "video": {
    "tool": "mediapipe",
    "fps": 25,                                 // how many images to generate in one second of the video
    "multiFace": {                             // TalkNet ASD configs
      "enable": false,                         // enable ASD
      "facedetScale": 0.25,
      "minTrack": 10,
      "numFailedDet": 10,
      "minFaceSize": 1,
      "cropScale": 0.4
    },
    "args": {
      "face_mesh": {                           // face_mesh mode, only facial landmarks
        "refine_landmarks": true,              // more landmarks around eyes
        "min_detection_confidence": 0.35,      // range between [0.0, 1.0]
        "min_tracking_confidence": 0.5         // range between [0.0, 1.0]
      },
      "holistic": {
        "model_complexity": 1,                 // 0, 1 or 2. The higher, the more accurate & slower
        "smooth_landmarks": true,              // filter landmarks across different input images to reduce jitter
        "enable_segmentation": true,           // visualization, mask background
        "smooth_segmentation": true,           // visualization, reduce jitter
        "min_detection_confidence": 0.5,       // range between [0.0, 1.0]
        "min_tracking_confidence": 0.5         // range between [0.0, 1.0], if lower than this, will invoke face detection
      },
      "visualize": false,                      // output visulized landmarks, ignored when extracting datasets
      "visualize_dir": "~/.MMSA-FET/visualize" // image output directory
    }
  }
}

2.3 TalkNet

TalkNet config is a section inside other video tools' config. See above examples.

3. Text Tools

3.1 BERT

{
  "text": {
    "model": "bert",
    "device": "cpu",                   // due to unresolved issue of pytorch dataloader, gpu is supported only when num_workers is set to 0
    "pretrained": "bert-base-uncased", // pretrained model name passed to huggingface transformers
    "args": {}
  }
}

3.2 XLNet