Speech - RoBorregos/robocup-home GitHub Wiki
Speech
Actual state overview
At this moment the Speech stack is composed of two main components:
- Speech To Text
- Text To Speech
Speech to Text Stack
Consists of 5 components that each is a ROS nodes with topics.
-
AudioCapturer
devices/AudioCapturer [python]: It is a node that captures the audio using PyAudio and publishes it to the topic rawAudioChunk.
-
GetUsefulAudio
There are two options to get the useful audio:
devices/InputAudio [c++]: A node that takes the chunks of audio and, using RNNoise, checks for a voice, cut the audio, removes the noise, and publishes it to the topic UsefulAudio. RNNoise approach fails after running for a while, very long silences affect it.
devices/UsefulAudio [python]: A node that takes the chunks of audio and, using webrtcvad, checks for a voice, cut the audio and publishes it to the topic UsefulAudio. Webrtcvad approach was made as an alternative that don´t remove silence but obtains the pieces of audio when someone speaks perfectly, it has a very good performance.
-
Engine Selector
action_selectors/hear [python]: This node receives the requests of STT. It checks if there is an internet connection, to know whether to call the offline or online engine; this can be overridden with FORCE_ENGINE parameter.
- Online engine: it is in AzureSpeechToText node. For that, this node processes the audio of UsefulAudio do a resample of 16KHz and publishes it in a new topic called UsefulAudioAzure to relay it to that node.
- Offline engine: it is in DeepSpeech node. For that, this node redirect the audio of UsefulAudio to a new topic called UsefulAudioDeepSpeech to relay it to that node.
-
Azure Engine
action_selectors/AzureSpeechToText [c++]: A node that takes the audio published in the topic UsefulAudioAzure and send it to the Azure SpeechToText API, receives the text and publishes it to the topic RawInput.
-
DeepSpeech2 Engine
action_selectors/DeepSpeech [python]: A node that takes the audio published in the topic UsefulAudioDeepSpeech and calls DeepSpeech2, converts it to text, and publishes it to the topic RawInput.
Text To Speech Stack
Consists of 1 component that is a ROS node with topics.
-
Say
devices/say [python]: It is a node that say through the speakers what is published under robot_text topic. It has a topic to notify another nodes that the robot is talking inputAudioActive. It uses Google gTTS engine as an online alternative or pyttsx3 as an offline alternative.
Launch File
roslaunch src/action_selectors/launch/conversation_speech.launch
Miscellaneous
- Retrain LM: To reduce and adapt the LM to our case, kenlm is used. With the kenlm's
lmplz
,filter
andbuild_binary
a "fine-tunning" is done to generate a new adapted LM with specific phrases of the competition. Check it here. - Others: An internal dataset using a website has been created to fine-tune the speech model.
Installation Requirements
Check this wiki page.
Documents
- A review of the speech-related technologies we have used and use here.
Conversation
Inside RASA Folder
1-. Setup Environment.
virtualenv -p python3 venv
sudo make install
2-. Train and Create model.
sudo make rasa.train
3-. Start action server, interact through shell.
sudo make rasa.develop.run
4-. Start action server, interact though API.
sudo make rasa.run