Turn management based on VAP - isir/greta GitHub Wiki

Turn management module based on voice activity projection (VAP) model.

This module considers both of signals from the user and the agent.

Currently, there are two modes:

  • VAP based on audio
  • VAP based on audio and face image embeding

Installation

Please follow common installation

Download models

  1. Download and uncompress dlib face detection models from here and here, then place them into bin/Common/Data/TurnManagement/dlib_models
  2. Download all VAP models in model/VAP from here and place it into bin/Common/Data/TurnManagement/models/

Usage

  1. In Modular.jar, add the following modules:
  • Feedbacks module from [Add -> Feedbacks -> Feedbacks]
  • Capture Controller module from [Add -> Player -> Capture Controller]
  • DeepGramContinuous module from [Add -> Dialogue -> DeepASR -> DeepGramContinuous]
  • Variant of incremental LLM module (Mistral Incremental, MI Counselor Incremental) from [Add -> Dialogue -> LLM -> ...]
  • Microphone module from [Add -> Input -> Microphone]
  • TurnManagementContinuous module from [Add -> Input -> Dialogue -> TurnManagementContinuous]
  1. Create the following connections in Modular.jar:
  • Capture Controller -> Ogre Player
  • Behavior Realizer -> Feedbacks
  • Feedbacks -> TurnManagementContinuous
  • Feedbacks -> DeepGramContinuous
  • DeepASR -> TurnManagementContinuous
  • TurnManagementContinuous -> BehaviorPlanner
  • TurnManagementContinuous -> Incremental LLM
  1. In Incremental LLM module, select "Online" model, click "enable" checkbox
  2. In DeepGramContinuous module, click "enable" checkbox, click "Listen" button
  3. In Capture Controller module, click "fixed index" checkbox, select "real-time" checkboxt, click "Video" button
  4. In TurnManagementContinuous module, select model to run (e.g. audio only, audio/faceEmbed), click "activate" checkbox
  5. Wait for a while, until you observe "[TurnManagement Greta] Generator started" in the output log in the Netbeans.
  • If you select audio/faceEmbed model, you can also observe clipped faces of the user and the agent for inspection.
  1. Now, you can start the conversation (the initial turn is always from you)
  • Since the model is VAP based neural network, the prediction is not perfect all the time. You may need to repeat the word sometime.
  • The VAP model based on audio and face image embedding requires more time to start up, so wait for a while

Note

License for pretrained models

  • A pre-trained CPC model, located at encoders/cpc/60k_epoch4-d0f474de.pt, is from the original CPC project and please follow its specific license. Refer to the original repository at (https://github.com/facebookresearch/CPC_audio) for more details.
  • A pre-trained FormerDFER model, located at encoders/FormerDFER/DFER_encoder_weight_only.pt, is the simplified (from model_set_1.pt, eliminated temporal transformer and linear layer) version of the original pre-trained model from the original Former-DFER project. Please follow its specific license. Refer to the original repository at (https://github.com/zengqunzhao/Former-DFER) for more details.

Reference

  • Erik Ekstedt, Gabriel Skantze. 2022. “Voice Activity Projection: Self-Supervised Learning of Turn-Taking Events.” In Interspeech 2022, 5190–94. ISCA: ISCA.
  • Koji Inoue, Bing’er Jiang, Erik Ekstedt, Tatsuya Kawahara, and Gabriel Skantze. 2024. “Real-Time and Continuous Turn-Taking Prediction Using Voice Activity Projection.” arXiv [Cs.CL]. arXiv. http://arxiv.org/abs/2401.04868.
  • Takeshi Saga, Catherine Pelachaud, "Voice Activity Projection Model with Multimodal Encoders", arXiv [Cs.CL]. arXic. https://arxiv.org/abs/2506.03980.

Screenshot

If the following images are difficult to view, please try opening them in a new tab from the right-click menu.

スクリーンショット 2025-04-04 180110