Audio System - mmlabox/TeamAudio GitHub Wiki

MMBOX - Audio System

1. Introduction

The Multimodal box (MMBOX) system is divided into three main parts; Audio, System and Video. This documentation will cover the first part, audio. At the time this documentation is written, the whole system is still in a MVP-phase. This document will not be provided as a manual for setting up and using the system, but rather aims to give a good understanding of how the system is designed and implemented.

2. Implementation

2.1 Overview

The Audio system’s main task is to use speaker diarization in order to know who said what and when. Viewing it as a Black Box, we have audio as input, and a Pandas Dataframe as output. The Diagram below shows a simple overview of the Audio System as a Black Box.

Black box of audio system

This is a very simplified model of the Audio System, and in order to understand it more fully, we can divide the system into three main parts; Client, Server and Speaker Diarization. The system is built modulary, meaning, each of the three parts could be switched out and replaced, as long as it follows the communication protocol between the modules.

2.2 Modules - Communication

The Different modules communicate by different protocols. Client sends data to the Server with ZMQ which is an open source framework developed for asynchronous messaging. The Server which processes the data from the client, is using the Speaker Diarization to retrieve transcribed words as well as speaker tags from the audio file. Then packaging the retrieved data into Pandas dataframes. The server then uses a Library for sending the dataframes to an Influx Database. As you can see, each part could easily be swapped out and/or changed by just adjusting the code a bit. As it is for now, the Speaker Diarization is performed through a Google Cloud API. But all we need from the API is the transcribed files, of who said what. By swapping it out we could just remove the code that calls the API and retrieves the data (packaged in python objects) to whatever Speaker Diarization tool/framework we want, and adjust it to package the data into dataframes. We could also swap out the Client, because all that connects the Client with the Server is the ZMQ protocol, where we could for instance switch protocol (because all we need is a audiofile) or just add another Client that uses the protocol to connect and send data to the Server.

Image of modules of audio system

2.3 Modules - Client

The Client is simply a python script designed for running on a Raspberry Pi connected to a Respeaker 4-Mic Array. The script should first be running through the terminal, when running, the user has to press ‘s’ on the keyboard to start recording. The Script is designed to record until an interruption occurs. Right now there is two types of interruptions:

  • User press ‘q’ on the keyboard
  • 59 seconds have passed. This is a current limitation set due to the Google Cloud API for Speaker Diarization only transcribes audio files up to 60 seconds.

When the audio is recorded (as a .wav -file) it is packaged into a Hashmap together with other values and sent through the ZMQ protocol to the server. These other (than wav -file) values is the following:

  • channels (number of microphone arrays)
  • sample rate (hz in the audio file)
  • people (number of people speaking in the audio file) Note: This is not yet implemented and has to be fixed, at the moment this information is not passed from client to server and is hardcoded in the server (set at 4 people).

All these values can be seen and configured at the beginning of the app.py file. Right now the values are hardcoded from the client side (instead of server side) for higher modularity. This could be improved by creating an interface for also providing this information before every recording. Also note that this information is specific for the current implementation of Speaker Diarization, because the Google Cloud API requires this information in the API calls.

The Client consists of two python files:

  • app.py - handles the recording of sound and start and quit input.
  • client.py - sends the data through the ZMQ protocol to the server.

2.4 Modules - Server

The Server module currently consists of python scripts running on a Linux VM on the cloud. The server is divided into two main files:

  • diarization_service.py
  • zmq_server.py

There is also two more files:

  • testclient.py (test client for the zmq server, sends a audio file, not updated to use the hashmap with all the values in its message)
  • config.py (hidden via .gitignore, containing credentials for the influxDB)

The diarization_service.py contains the code for talking to the Google Cloud API (Speaker Diarization) and to map the resulting values into a Pandas dataframe.

The zmq_server.py is the code that sits and wait for incoming requests with the ZMQ protocol and then delegates the data to the diarization_service.py.

The Server and Client talks via ZMQ PUSH/PULL -pattern (read more about this here)

Server is started by simply running the zmq_server.py in the Terminal (for instance). Remember to also configure ports and I/O to allow incoming tcp connection on the specified port for the ZMQ.

The credentials used for InfluxDB is currently in a config.py file with its according variable (see image below), this could ofcourse be modified, as long as you provide the parameters for the InfluxDB connection in the diarization_service.py

config file

2.5 Modules - Speaker Diarization

The Speaker Diarization module currently consists of a Google Cloud API ( https://cloud.google.com/speech-to-text/docs/multiple-voices ). There is comprehensive documentation on the API and there's no need to dig too deep into it in this document. As previously mentioned there is a set of parameters that can be adjusted. The most fluctuant ones in our case is:

  • channels
  • number of people
  • sample rate

Another interesting parameter is the “model”, usually it's set to nothing which means it uses the standard model. We have currently changed it to “video” (https://cloud.google.com/speech-to-text/docs/transcription-model) where it in our case boosted the results a lot. The video model allows for recognition of even lower quality sounds (and is a bit more pricier than the standard model). This is also the one we have done most of our tests with since we were all working from distance due to the corona pandemic, and had to record audio from one microphone while talking over video-call.

The API returns the results in a result-object where we take out the word list and iterate through it to calculate the number of words each speaker has said, as well as the list of all the words associated with that speaker.

All the code for using the Speaker Diarization is manifasted in the Server code under diarization_service.py

3. Improvements

3.1 Speaker Diarization Framework

First and foremost, the current framework for speaker diarization is not optimal due to a couple of reasons:

  • Its expensive
  • We don't own the data, we just send a audiofile to Google which in return gives us transcribed text. This is against the idea of integrity of user and user data stated in the project.
  • We cant control the model - we have no idea how the model works, we're just seeing the frontside of it through its API, we can't improve it or tweak it too much to fit our needs.

The only reason we chose the Google Cloud API for Speaker Diarization was to have a finished MVP. We were initially working with pyannote (a open source Speaker Diarization framework) but it would take too much time in order to get it to work properly. Hence, the current speaker diarization is only for proof of concept.

3.2 Real Time Data Streaming

As it is for now, the recordings is manual and is within a given timeframe. This also results in the timeframes in the influxDB not being optimized and matching the other sensor data of the same timeframe. The timestamps is given at the point where the data is sent to the InfluxDB, meaning if we get it all in a batch at a time, they will all have the same data, even if it could differ up to 58 seconds between the words.