Technical showcase - isir/greta GitHub Wiki

List of contents

Processing flow and synchronization in Greta

Processing flow

Basically, you can connect as you want as long as the input signal type is compatible and the connection is defined in bin/Modular.xml.

For the process from SpeechKeyframe to PhonemSequence and Audio (Text-to-speech process) inside the BehaviorRealizer, refer TTS process section

Step Module name Input Output Module example
1 IntentionEmitter FML Intention FML File Reader, Mistral
2 BehaviorPlanner Intention Signal Behavior Planner
3 BehaviorRealizer BML, Signal Keyframe Behavior Realizer
- (4-1 and 4-2 and 4-3 run in parallel) - - -
4-1 AnimationKeyframePerformer Gesture/Head/Shoulder/TorsoKeyframe BAPFrame
4-2 LipModel PhonemSequence (nearly equal to phoneme keyframe) FAPFrame
4-3-1 FaceKeyframePerformer AUKeyFrame AUAPFrame
4-3-2 SimpleAUPerformer AUAPFrame FAPFrame
- - - - -
5 FAPFrame/BAPFrame/Audio Performer FAPFrame, BAPFrame, Audio MPEG4 Animatable

TTS

Since the timimg information in Greta platform is driven by TTS, TTS process have to be done before other modaility realizations. Based on the timing information of PhonemSequence and Audio objects, Greta manages further synchronization with other modalities.

  1. Inside Behavior Realizer, temporizer.temporize() is called with the list of signals including SpeechSignal, which is an extention of Speech class.
  2. Inside temporizer.temporize(), SpeechSignal.schedule() is called
  3. Inside SpeechSignal.schedule(), TTS engine is called (via characterManager) to convert SpeechSignal to PhonemSequence and Audio

Synchronization

  1. Once temporizer.temporize() is called inside the Behavior Realizer, missing timing informations are filled and checked the value appropriateness for the list of signals.
  2. The signals are converted to keyframes by keyframe generators.
  3. Those keyframes are re-ordered in the order of time markers.
  4. KeyframePerformer.performKeyframes() is called for each modality.
  5. While processing the KeyframePerformer.performKeyframes(), those keyframe information (id, absolute start time, absolute end time) is also sent to CallbackSender which controls callback emission timing (e.g., callback telling agent's speech start and end).
  6. Inside the KeyframePerformer, frame interporlation and generation of FAPFrame/BAPFrame has been done, then sent them to animation players (e.g., OgrePlayer)

How to capture agent speech audio on time

Since Greta doesn't have realtime audio interface, we need to use some tricky techniques.

Method1: load audio using feedback

In this section, I explains an implementation example in Turn-management based on VAP, where the audio is loaded using callbacks generated by Feedbacks module. Since the turn-management module uses python-based VAP model, the core part of implementation was in Python; thus, we basically use Java parts as interfaces for Python. Please check turnManagement module in Grata Java project and bin/Common/Data/TurnManagement/turnManager_vap_audio_faceEmbed_refactored.py for the detail.

  1. When the agent start speaking, "start" callback is sent.
  2. Once your module receives "start" callback, you can now load audio wav file in bin\output.wav.
  3. In turnManager_vap_audio_faceEmbed_refactored.py, agentAudio class in func_vap.py handles this part.
  4. When "start" callback is received, "agent_speaking_state" multiprocess-shared object becomes True, indicating the agent is speaking.
  5. The agentAudio loads audio chunk when get() method is called until "end" callback is received, which is sent by the the Feedbacks module.
  6. When "end" callback is received, "agent_speaking_state" multiprocess-shared object becomes False.
  7. If agent is not speaking, agentAudio extend the audio chunk with blank sequence with 0 values.

Method2: receive audio signal through CaptureController

Although I have not checked it yet, I saw possible candidate implementation, which might be able to send audio signal to others at CaptureController. You might be able to use it as an alternative.

But note that, if you want to use it, you might need to implement server-client communication by yourself.

How to integrate neural networks working real-time

Case1: MODIFF

MODIFF is a diffusion-based facial expression generation model. It takes interlocutor's facial action units from OpenFace and agent's ones as input and output agent's facial action units in the next time step. It uses previous output of the model itself as one of the input.

Case2: Turn management based on VAP

The VAP model predicts upcoming turn-taking behaviors (turn-shift, backchannel, etc.) from previous audio and facial image of user/agent.