Technical showcase - isir/greta GitHub Wiki
List of contents
- Processing flow and synchronization in Greta
- How to capture agent speech audio on time
- How to integrate neural networks working real-time
Processing flow and synchronization in Greta
Processing flow
Basically, you can connect as you want as long as the input signal type is compatible and the connection is defined in bin/Modular.xml
.
For the process from SpeechKeyframe
to PhonemSequence
and Audio
(Text-to-speech process) inside the BehaviorRealizer
, refer TTS process section
Step | Module name | Input | Output | Module example |
---|---|---|---|---|
1 | IntentionEmitter | FML | Intention | FML File Reader, Mistral |
2 | BehaviorPlanner | Intention | Signal | Behavior Planner |
3 | BehaviorRealizer | BML, Signal | Keyframe | Behavior Realizer |
- | (4-1 and 4-2 and 4-3 run in parallel) | - | - | - |
4-1 | AnimationKeyframePerformer | Gesture/Head/Shoulder/TorsoKeyframe | BAPFrame | |
4-2 | LipModel | PhonemSequence (nearly equal to phoneme keyframe) | FAPFrame | |
4-3-1 | FaceKeyframePerformer | AUKeyFrame | AUAPFrame | |
4-3-2 | SimpleAUPerformer | AUAPFrame | FAPFrame | |
- | - | - | - | - |
5 | FAPFrame/BAPFrame/Audio Performer | FAPFrame, BAPFrame, Audio | MPEG4 Animatable |
TTS
Since the timimg information in Greta platform is driven by TTS, TTS process have to be done before other modaility realizations.
Based on the timing information of PhonemSequence
and Audio
objects, Greta manages further synchronization with other modalities.
- Inside Behavior Realizer,
temporizer.temporize()
is called with the list of signals including SpeechSignal, which is an extention of Speech class. - Inside
temporizer.temporize()
,SpeechSignal.schedule()
is called - Inside
SpeechSignal.schedule()
, TTS engine is called (via characterManager) to convert SpeechSignal to PhonemSequence and Audio
Synchronization
- Once
temporizer.temporize()
is called inside the Behavior Realizer, missing timing informations are filled and checked the value appropriateness for the list of signals. - The signals are converted to keyframes by keyframe generators.
- Those keyframes are re-ordered in the order of time markers.
KeyframePerformer.performKeyframes()
is called for each modality.- While processing the
KeyframePerformer.performKeyframes()
, those keyframe information (id, absolute start time, absolute end time) is also sent toCallbackSender
which controls callback emission timing (e.g., callback telling agent's speech start and end). - Inside the KeyframePerformer, frame interporlation and generation of FAPFrame/BAPFrame has been done, then sent them to animation players (e.g., OgrePlayer)
How to capture agent speech audio on time
Since Greta doesn't have realtime audio interface, we need to use some tricky techniques.
Method1: load audio using feedback
In this section, I explains an implementation example in Turn-management based on VAP, where the audio is loaded using callbacks generated by Feedbacks module. Since the turn-management module uses python-based VAP model, the core part of implementation was in Python; thus, we basically use Java parts as interfaces for Python. Please check turnManagement
module in Grata Java project and bin/Common/Data/TurnManagement/turnManager_vap_audio_faceEmbed_refactored.py
for the detail.
- When the agent start speaking, "start" callback is sent.
- Once your module receives "start" callback, you can now load audio wav file in
bin\output.wav
. - In
turnManager_vap_audio_faceEmbed_refactored.py
,agentAudio
class infunc_vap.py
handles this part. - When "start" callback is received, "agent_speaking_state" multiprocess-shared object becomes
True
, indicating the agent is speaking. - The
agentAudio
loads audio chunk whenget()
method is called until "end" callback is received, which is sent by the the Feedbacks module. - When "end" callback is received, "agent_speaking_state" multiprocess-shared object becomes
False
. - If agent is not speaking,
agentAudio
extend the audio chunk with blank sequence with 0 values.
Method2: receive audio signal through CaptureController
Although I have not checked it yet, I saw possible candidate implementation, which might be able to send audio signal to others at CaptureController. You might be able to use it as an alternative.
But note that, if you want to use it, you might need to implement server-client communication by yourself.
How to integrate neural networks working real-time
MODIFF
Case1:MODIFF is a diffusion-based facial expression generation model. It takes interlocutor's facial action units from OpenFace and agent's ones as input and output agent's facial action units in the next time step. It uses previous output of the model itself as one of the input.
Turn management based on VAP
Case2:The VAP model predicts upcoming turn-taking behaviors (turn-shift, backchannel, etc.) from previous audio and facial image of user/agent.