Turn management based on VAP - isir/greta GitHub Wiki

Turn management module based on voice activity projection (VAP) model.

This module considers both of signals from the user and the agent.

Currently, there are two modes:

VAP based on audio

VAP based on audio and face image embeding

Installation

Please follow common installation

Download models

Download and uncompress dlib face detection models from here and here, then place them into bin/Common/Data/TurnManagement/dlib_models
Download all VAP models in model/VAP from here and place it into bin/Common/Data/TurnManagement/models/

Usage

In Modular.jar, add the following modules:

Feedbacks module from [Add -> Feedbacks -> Feedbacks]

Capture Controller module from [Add -> Player -> Capture Controller]

DeepGramContinuous module from [Add -> Dialogue -> DeepASR -> DeepGramContinuous]

Variant of incremental LLM module (Mistral Incremental, MI Counselor Incremental) from [Add -> Dialogue -> LLM -> ...]

Microphone module from [Add -> Input -> Microphone]

TurnManagementContinuous module from [Add -> Input -> Dialogue -> TurnManagementContinuous]

Create the following connections in Modular.jar:

Capture Controller -> Ogre Player

Behavior Realizer -> Feedbacks

Feedbacks -> TurnManagementContinuous

Feedbacks -> DeepGramContinuous

DeepASR -> TurnManagementContinuous

TurnManagementContinuous -> BehaviorPlanner

TurnManagementContinuous -> Incremental LLM

In Incremental LLM module, select "Online" model, click "enable" checkbox
In DeepGramContinuous module, click "enable" checkbox, click "Listen" button
In Capture Controller module, click "fixed index" checkbox, select "real-time" checkboxt, click "Video" button
In TurnManagementContinuous module, select model to run (e.g. audio only, audio/faceEmbed), click "activate" checkbox
Wait for a while, until you observe "[TurnManagement Greta] Generator started" in the output log in the Netbeans.

If you select audio/faceEmbed model, you can also observe clipped faces of the user and the agent for inspection.

Now, you can start the conversation (the initial turn is always from you)

Since the model is VAP based neural network, the prediction is not perfect all the time. You may need to repeat the word sometime.

The VAP model based on audio and face image embedding requires more time to start up, so wait for a while

Note

For more details, please check the following sources
- Microphone java project at auxiliary/Microphone
- Python source code at bin/Common/Data/microphone
Loading agent speech on time: https://github.com/isir/greta/wiki/Technical-showcase#load-audio-using-feedback

License for pretrained models

A pre-trained CPC model, located at encoders/cpc/60k_epoch4-d0f474de.pt, is from the original CPC project and please follow its specific license. Refer to the original repository at (https://github.com/facebookresearch/CPC_audio) for more details.
A pre-trained FormerDFER model, located at encoders/FormerDFER/DFER_encoder_weight_only.pt, is the simplified (from model_set_1.pt, eliminated temporal transformer and linear layer) version of the original pre-trained model from the original Former-DFER project. Please follow its specific license. Refer to the original repository at (https://github.com/zengqunzhao/Former-DFER) for more details.

Reference

Erik Ekstedt, Gabriel Skantze. 2022. “Voice Activity Projection: Self-Supervised Learning of Turn-Taking Events.” In Interspeech 2022, 5190–94. ISCA: ISCA.
Koji Inoue, Bing’er Jiang, Erik Ekstedt, Tatsuya Kawahara, and Gabriel Skantze. 2024. “Real-Time and Continuous Turn-Taking Prediction Using Voice Activity Projection.” arXiv [Cs.CL]. arXiv. http://arxiv.org/abs/2401.04868.
Takeshi Saga, Catherine Pelachaud, "Voice Activity Projection Model with Multimodal Encoders", arXiv [Cs.CL]. arXic. https://arxiv.org/abs/2506.03980.

Screenshot

If the following images are difficult to view, please try opening them in a new tab from the right-click menu.

スクリーンショット 2025-04-04 180110