Turn management based on VAP - isir/greta GitHub Wiki
Turn management module based on voice activity projection (VAP) model.
This module considers both of signals from the user and the agent.
Currently, there are two modes:
- VAP based on audio
- VAP based on audio and face image embeding
Installation
Please follow common installation
Download models
- Download and uncompress dlib face detection models from here and here, then place them into bin/Common/Data/TurnManagement/dlib_models
- Download all VAP models in model/VAP from here and place it into bin/Common/Data/TurnManagement/models/
Usage
- In Modular.jar, add the following modules:
- Feedbacks module from [Add -> Feedbacks -> Feedbacks]
- Capture Controller module from [Add -> Player -> Capture Controller]
- DeepGramContinuous module from [Add -> Dialogue -> DeepASR -> DeepGramContinuous]
- Variant of incremental LLM module (Mistral Incremental, MI Counselor Incremental) from [Add -> Dialogue -> LLM -> ...]
- Microphone module from [Add -> Input -> Microphone]
- TurnManagementContinuous module from [Add -> Input -> Dialogue -> TurnManagementContinuous]
- Create the following connections in Modular.jar:
- Capture Controller -> Ogre Player
- Behavior Realizer -> Feedbacks
- Feedbacks -> TurnManagementContinuous
- Feedbacks -> DeepGramContinuous
- DeepASR -> TurnManagementContinuous
- TurnManagementContinuous -> BehaviorPlanner
- TurnManagementContinuous -> Incremental LLM
- In Incremental LLM module, select "Online" model, click "enable" checkbox
- In DeepGramContinuous module, click "enable" checkbox, click "Listen" button
- In Capture Controller module, click "fixed index" checkbox, select "real-time" checkboxt, click "Video" button
- In TurnManagementContinuous module, select model to run (e.g. audio only, audio/faceEmbed), click "activate" checkbox
- Wait for a while, until you observe "[TurnManagement Greta] Generator started" in the output log in the Netbeans.
- If you select audio/faceEmbed model, you can also observe clipped faces of the user and the agent for inspection.
- Now, you can start the conversation (the initial turn is always from you)
- Since the model is VAP based neural network, the prediction is not perfect all the time. You may need to repeat the word sometime.
- The VAP model based on audio and face image embedding requires more time to start up, so wait for a while
Note
- For more details, please check the following sources
- Microphone java project at auxiliary/Microphone
- Python source code at bin/Common/Data/microphone
- Loading agent speech on time: https://github.com/isir/greta/wiki/Technical-showcase#load-audio-using-feedback
License for pretrained models
- A pre-trained CPC model, located at encoders/cpc/60k_epoch4-d0f474de.pt, is from the original CPC project and please follow its specific license. Refer to the original repository at (https://github.com/facebookresearch/CPC_audio) for more details.
- A pre-trained FormerDFER model, located at encoders/FormerDFER/DFER_encoder_weight_only.pt, is the simplified (from model_set_1.pt, eliminated temporal transformer and linear layer) version of the original pre-trained model from the original Former-DFER project. Please follow its specific license. Refer to the original repository at (https://github.com/zengqunzhao/Former-DFER) for more details.
Reference
- Erik Ekstedt, Gabriel Skantze. 2022. “Voice Activity Projection: Self-Supervised Learning of Turn-Taking Events.” In Interspeech 2022, 5190–94. ISCA: ISCA.
- Koji Inoue, Bing’er Jiang, Erik Ekstedt, Tatsuya Kawahara, and Gabriel Skantze. 2024. “Real-Time and Continuous Turn-Taking Prediction Using Voice Activity Projection.” arXiv [Cs.CL]. arXiv. http://arxiv.org/abs/2401.04868.
- Takeshi Saga, Catherine Pelachaud, "Voice Activity Projection Model with Multimodal Encoders", arXiv [Cs.CL]. arXic. https://arxiv.org/abs/2506.03980.
Screenshot
If the following images are difficult to view, please try opening them in a new tab from the right-click menu.