Engage Linguistics Service - rallytac/pub GitHub Wiki

Linguistics Processing With Engage

Something fairly new (since around October 2022 or so) on our journey of innovation has been the development of a new (micro?) service named Engage Linguistics Service (or ELS for short). And it's pretty darn cool!

Let's get to it...

First off, ELS is a headless "service" application just like Engage Bridge Service (EBS) and Engage Activity Recorder (EAR). This means it runs somewhere on the network and performs operations on your Engage groups. (Hopefully you know what those are!) Similar to EBS, ELS kinda "patches" groups together but with a difference. Rather than just essentially forwarding traffic between groups in a session like EBS does, ELS instead carries out a variety of transformative operations on the content of the traffic it is processing. These transformations encompass a number of different operations such as converting spoken voice into text (transcription), converting text into spoken voice (synthesis), and conversion of text from one language to another (translation). This has all kinds of cool applications in areas such as subtitles and transcriptions, covert communications, and the like. But, for the first go-around of the code, we decided to tackle the most difficult of all: translating spoken languages across groups.

Great, so what does THAT mean!? Well, it means you can setup ELS to take, say, a group where everyone is speaking English, convert that to, say, French, and send the French version of what was said to another group where French-speakers are. (Of course that works in reverse too - someone speaks French on the French group and the English people hear that in English.) Or you could translate between German and Polish; or between Spanish, Latvian, Italian, Ukrainian, Korean, and Japanese. So we're not just talking about one language to another language; we're talking multiple languages translated simultaneously between each other. Pretty slick huh!?

Because we're just using regular Engage groups, your users can use whatever device they have to speak and listen. Engage (and ELS) will take care of the rest. So now you can have German-speaking folks on desktop computers speaking to Dutch users on Android devices and English-speaking people on two-way radios and Farsi users on Apple iOS devices.

How's It Work

Above we'd said that ELS behaves a lot like EBS in that it kinda "patches" groups together. In fact, its a helluva lot like EBS - purposefully so - especially in terms of how its configured. In EBS, you define bridges and the groups within those bridges. In ELS you define sessions and the groups within those sessions. (We worked pretty hard to be as consistent as we can with this stuff to flatten the learning curve. Hopefully you'll think we did a good job.)

Architecture

In an Engage-based system you've got a lot of ways to architect the layout of your service components, what platforms they're running on, what networks they're working over, and where these things are located. To keep things as simple as possible, though, here's a diagram showing a fairly typical architecture. image

What we have here is a single machine/VM running Ubuntu 22.04 housing ELS, a proxy for ELS to speak to a back-end, and a Rallypoint (RP). Then, we have Engage-based client applications connecting to that infrastructure in two ways: either directly to the RP where ELS lives, or to other RPs which are in turn connected (peered) to ELS' RP. It doesn't really matter how these clients are connecting into the RP-based infrastructure - i.e. what their ingress method is. All that's really important is that they can get connected and that they register with the RPs with the correct group (channel) identifiers - see below.

Configuration

SUPER IMPORTANT - PLEASE READ !!!!

ELS (and its proxy described below) are a little different from the other things we make in that they do not currently reload their configurations at runtime. There's a number of reasons for this - not the least of which being that this stuff is really really REALLY complicated and having the code process configuration changes at runtime is a very risky endeavor. So, if you make changes to your ELS (or proxy) configurations, those services MUST be restarted

Alright, let's take a look at some examples of how to configure ELS.

First-off, ELS requires a JSON file that provides its core configuration. We won't show the whole thing here - it's pretty much a template. But we will talk about a couple of settings unique to ELS.

{
   .
   .
   "lingoConfigurationFileName":"/etc/engagelingod/lingo.json",
   .
   .
   "proxy":
   {
        "address":"127.0.0.1",
        "port":5555
   },
   .
   .
}

-lingoConfigurationFileName

In this JSON file we tell ELS what sessions we have and what groups are in those sessions. For example: let's imagine we want to translate between an English channel (we'll give it an ID of diners-english) and a French channel (waiters-french). (Yup, you guessed it - the imaginary situation is a bunch of English-speaking people dining in a French restaurant!)

{
    "voiceToVoiceSessions": [
        {
            "id": "bistro-le-souffle",
            "groups": [
                "waiters-french",
                "diners-english"
            ]
        }
    ],

    "groups": [        
        {
            "id": "waiters-french",
            "type": 1,
            "languageCode": "fr-FR",
            "txAudio": {
                "encoder": 25
            },
            "rallypoints": [
                {
                    "id": "local-rp",
                    "host": {
                        "address": "127.0.0.1",
                        "port": 7443
                    }
                }
            ]
        },
        {
            "id": "diners-english",
            "type": 1,
            "languageCode": "en-US",
            "txAudio": {
                "encoder": 25
            },
            "rallypoints": [
                {
                    "id": "local-rp",
                    "host": {
                        "address": "127.0.0.1",
                        "port": 7443
                    }
                }
            ]
        }
    ]
}

Notice how we have a voice-to-voice session with an ID of bistro-le-souffle. In that session, we have the IDs of the two groups we want ELS to translate between - diners-english and waiters-french. Then, in the groups section we tell ELS about those groups. In this case they're both using the Opus code (encoder 25) and both use the same Rallypoint. That is pretty-much standard Engage stuff.

The important field in those group definitions, though, is languageCode. This tells ELS what language is being spoken on that group - en-US denotes English (in an American dialect) and French in the dialect spoken in France itself (fr-FR). [There's TONS of languages and dialects. We'll get into that later.]

The group definitions really are just standard Engage group definitions. So whatever settings your groups have, put 'em here. That includes things like crypto passwords if your groups are encrypted, specialized codecs, multicast addressing instead of Rallypoint links, and so on. ELS uses our standard Engage Engine so all group configuration items you use elsewhere are valid here as well.

ELS does have other types of sessions as well as we've alluded to earlier. We're not going to talk about those right now as we're not yet satisfied with our work on them. So we'll just be using voice-to-voice` sessions right now.

Easy huh!?

OK, now that you've fired up ELS with this configuration, you're going to need to setup some Engage clients (or third-party systems like radios) to talk on those groups. In your Engage system you'll have a group with an ID of diners-english which you'll provide to your English people and another of waiters-french which you'd provide to your French folks. Assuming those groups connect to the same Rallypoint that ELS is connected to (or they're meshed together in some way), you'll have near real-time translations between those groups. [We say near realtime because its not technically possible to have actual realtime translation. There's going to be a second or two of delay between when you say something in one language and it comes out translated into another.]

Alright, now let's say we have some German folks walk in the door of our lovely restaurant, we'd want to add them too - no? Here's what that'll look like (we'll also compress the JSON a little for readability):

{
    "voiceToVoiceSessions": [
        {
            "id": "bistro-le-souffle",
            "groups": ["waiters-french", "diners-english", "diners-german"]
        }
    ],

    "groups": [        
        {
            "id": "waiters-french",
            "type": 1,
            "languageCode": "fr-FR",
            "txAudio": {"encoder": 25},
            "rallypoints": [{"id": "local-rp","host": {"address": "127.0.0.1","port": 7443}}]
        },
        {
            "id": "diners-english",
            "type": 1,
            "languageCode": "en-US",
            "txAudio": {"encoder": 25},
            "rallypoints": [{"id": "local-rp","host": {"address": "127.0.0.1","port": 7443}}]
        },
        {
            "id": "diners-german",
            "type": 1,
            "languageCode": "de-DE",
            "txAudio": {"encoder": 25},
            "rallypoints": [{"id": "local-rp","host": {"address": "127.0.0.1","port": 7443}}]
        }
    ]
}

Cool, let's add a new place of business right next door to our fancy French restaurant. Let's imagine it's a Spanish Bar named "Las Taberna". This place will cater to the unique tastes of Italians, Estonians, Germans, and Zulus.

{
    "voiceToVoiceSessions": [
        {
            "id": "bistro-le-souffle",
            "groups": ["waiters-french", "diners-english", "diners-german"]
        },
        {
            "id": "las-taberna",
            "groups": ["bartender-spanish", "patrons-italian", 
                       "patrons-estonian", "patrons-german", 
                       "patrons-zulu"]
        }
    ],

    "groups": [        
        {
            "id": "waiters-french",
            "type": 1,
            "languageCode": "fr-FR",
            "txAudio": {"encoder": 25},
            "rallypoints": [{"id": "local-rp","host": {"address": "127.0.0.1","port": 7443}}]
        },
        {
            "id": "diners-english",
            "type": 1,
            "languageCode": "en-US",
            "txAudio": {"encoder": 25},
            "rallypoints": [{"id": "local-rp","host": {"address": "127.0.0.1","port": 7443}}]
        },
        {
            "id": "diners-german",
            "type": 1,
            "languageCode": "de-DE",
            "txAudio": {"encoder": 25},
            "rallypoints": [{"id": "local-rp","host": {"address": "127.0.0.1","port": 7443}}]
        },


        {
            "id": "bartender-spanish",
            "type": 1,
            "languageCode": "es-ES",
            "txAudio": {"encoder": 25},
            "rallypoints": [{"id": "local-rp","host": {"address": "127.0.0.1","port": 7443}}]
        },
        {
            "id": "patrons-italian",
            "type": 1,
            "languageCode": "it-IT",
            "txAudio": {"encoder": 25},
            "rallypoints": [{"id": "local-rp","host": {"address": "127.0.0.1","port": 7443}}]
        },
        {
            "id": "patrons-estonian",
            "type": 1,
            "languageCode": "et-EE",
            "txAudio": {"encoder": 25},
            "rallypoints": [{"id": "local-rp","host": {"address": "127.0.0.1","port": 7443}}]
        },
        {
            "id": "patrons-german",
            "type": 1,
            "languageCode": "de-DE",
            "txAudio": {"encoder": 25},
            "rallypoints": [{"id": "local-rp","host": {"address": "127.0.0.1","port": 7443}}]
        },
        {
            "id": "patrons-zulu",
            "type": 1,
            "languageCode": "zu-ZA",
            "txAudio": {"encoder": 25},
            "rallypoints": [{"id": "local-rp","host": {"address": "127.0.0.1","port": 7443}}]
        }
    ]
}

Notice how we have a different German group for our bar vs the German group for the restaurant. Even though both groups speak German, they have nothing to do with each other and, besides, they're in different places. So it wouldn't make sense to have the same group in two different sessions (or places of business if you like).

As you can see, this stuff is pretty simple. Just make sure you have the language code correct (CASE SENSITIVE) and that you're not trying to put the same group into multiple sessions; and you'll be golden.

-proxy

Now, we here at RTS are hard-working, independent people of action. And we like to think we can do everything ourselves and not offload hairy scary stuff off to other people.

But it turns out that there is some programming that others can just do better than us. (Hard to believe huh!?) Language stuff is one of those disciplines where we've decided to lean on others for some of the work. And that's where powerhouses like Microsoft, Amazon, Google, and the like come into play. Those folks have tons of super-smart people, billions of R&D dollars, and almost unlimited computing resources available to them. And they've all invested a great deal into language processing, machine-learning, AI, and the like. So, we figured we'd hand off some of the heavy lifting to them. But ... we also wanted to keep our code as "clean" as possible while being super flexible at the same time.

So, what we did was split off some of the logic of ELS into external applications that talk to these backend systems ("clouds" if you will) to take advantage of the cool stuff those folks have built. We did this by creating a proxy. Its really just a program that runs alongside ELS that carries out cloud-specific operations using the native interfaces of the particular backend - while keeping those, varying, implementations outside of the core of ELS.

This is where the proxy settings in the configuration comes into play. You need only supply the address of the machine where the proxy is residing, and the TCP port it's listening on. This is almost always the same machine as ELS so we specify 127.0.0.1 as the address; and 5555 as the port (that's the proxy's default listening port).

The Proxy

We have a few proxies in the works for the above-mentioned providers - and a couple of others we're not going to talk about right now. The one that's furtherest along at the time of this writing - i.e. production-ready - is our proxy for Microsoft's Azure Cognitive Services. We've called it elsproxyazured.

Microsoft Azure Cognitive Services Proxy

You install the Azure proxy just like any of our other Linux software - using an installation package that you download from our artifacts repository or grab from your favorite vendor. Once installed, the proxy runs as a Linux daemon (hence the d at the end of its name) and waits for ELS to talk to it via TCP (remember from above it listens on port 5555 by default).

And just like everything else we make, you need to configure it with a configuration file. In the case of the Azure proxy, the configuration file is /etc/elsproxyazured/elsproxyazured_conf.json. Here's what it generally looks like:

{
    "settings": {
        "general":{
            "maxRunSecs":86400
        },
        "networking": {
            "listenPort": 5555
        },
        "metrics": {
            "enabled": false,
            "intervalSecs": 30,
            "directory": "/var/log/engagelingod/metrics",
            "maxFileAgeDays": 30
        },
        "azure": {
            "authentication": {
                "speech":{
                    "region": "",
                    "key": "",
                    "hosts": {
                        "stt":"",
                        "tts":""
                    },
                    "endpoints": {
                        "stt": "",
                        "tts": ""
                    }
                },
                "translator":{
                    "region": "",
                    "key": "",
                    "baseUrl": ""
                }
            },
            "transcription": {
                "silenceAppendMs": 50
            },
            "translation": {
                "requestTimeoutMs": 5000
            }
        },
        "watchdog": {
            "enabled": true,
            "intervalMs": 2000,
            "hangDetectionMs": 10000,
            "abortOnHang": true,
            "slowExecutionThresholdMs": 500
        }
    }
}
  • general.maxRunSecs : Tells the proxy how long it should run for (seconds) before shutting down and being restarted by the Linux systemd manager. The default of 86400 is 24 hours. Now, you don't really need to have the proxy restart but, because we're using binary, 3rd-party code in the proxy, we figured its best if it restarts every now and again to get the cobwebs out. (Basically, we don't know exactly what's happening inside an external party's binary code so we get a little paranoid about things like resource leaks, runaway loops, and other nasties we don't know about.)

  • general.networking.listenPort : The TCP port the proxy must listen on for connections from ELS.

  • general.metrics : This section deals with capturing of metrics of usage of Azure Cognitive Service. Metrics are written to rotating text files in CSV format. See below for details of the CSV data structure.

  • general.metrics.enabled : Set this to true if you want to capture metrics.

  • general.metrics.intervalSecs : The number of seconds between records being written to the active metrics file. The default is 30. This is essentially the time window in which the metrics are captured. In other words, metrics are captured for a time period specified by this value and, if any are available when that time period ends, a record is written. If there was no activity, no record is written. A smaller time window here will give you a higher fidelity of detail if you were using the CSV records to analyze the information over time - say drawing a graph showing usage at different times of the day.

  • general.metrics.directory : The path to the directory where metrics files are to be written.

  • general.metrics.maxFileAgeDays : The number of days to keep metrics files before they are automatically removed.

  • azure.authentication : For Azure Cognitive Services, we use the Speech and Translator services. So, you're going to need authentication information for these two services. Please check with your Microsoft representative on how to obtain these - as well as the URL required for the Azure Cognitive Services Translation Service.

  • azure.authentication.speech : This section is authentication information for Azure Cognitive Speech Services. You'll often need the region and key fields but, then, depending on your Azure setup, you may also need settings for hosts or endpoints to connect to. This stuff varies depending on your Azure environment so, again, please get with your Microsoft representative for further information. If your Azure environment requires that you connect to particular host machines, you'll need to enter that information accordingly. The stt host is for Speech-To-Text (also known as Automatic Speech Recognition) while tts is for Text-To-Speech (also known as Speech Synthesis). On the other hand, if your environment has custom endpoints, then you'll need to fill in the corresponding endpoint values for stt and tss in the endpoints section. These two sections are generally needed if you have a custom deployment off Azure Cognitive Services such as a sovereign cloud.

  • azure.authentication.translator : Here you'll need to provide information related to Azure Cognitive Service's translation feature. Note that what is meant by translation is translating of text to text. It does not mean translation of voice - it's only for text. However, for ELS to perform voice-to-voice translation, you still need this section filled out as the process of translating voice to voice includes text translation.

  • watchdog.xxxx : Leave this stuff alone unless instructed otherwise by support personnel.

Variations On A (Authentication) Theme

What we described above concerning the authentication section covers typical use-cases for authenticating with Azure Cognitive Services. But ... things can be somewhat different depending on your particular setup. Let's talk about that a little ...

Azure Cognitive Services - at least the ones we're concerned with here - are deployed in 3 ways that we know of. These are public cloud, sovereign cloud, and disconnected.

Public Cloud

In this scenario your proxy will be connecting to the public Azure cloud and is the simplest scenario of the three. Here you typically need only provide values for the region and key items under azure.authentication.speech and azure.authentication.translator - i.e. just plug in the values received from your Azure contact for region and key and you're on your way. [But don't forget the baseUrl value for the translator piece.]

Here's an example for connectivity to a public cloud (not using real values of course):

.
.
"authentication": {
    "speech": {
        "region": "westus66",
        "key": "4570857cae764cbfba4b9cc5fe9ea412",
    },
    "translator": {
        "region": "westus66",
        "key": "725a7fca3dd84a4da32c152a34737024",
        "baseUrl": "https://api.cognitive.microsofttranslator.com/translate?api-version=3.0"
    }
}
.
.
Sovereign Cloud

When you run a sovereign (aka "private") cloud, your organization is hosting Cognitive Services or you have a private cloud setup within Azure accessible only to your organization. In this setup your authentication for speech and translator functionality varies a little and typically involves particular values for hosts or endpoints in your speech authentication. You may or may not need values for region and/or key. All that stuff is highly dependent on how your sovereign cloud is setup by your organization and/or Azure.

.
.
"authentication": {
  "speech": {
    "region": "usgovsomeplace",
    "key": "e010910040054e31afa62dce0150afb4",
    "endpoints": {
      "stt": "wss://myspecialservice.cognitiveservices.azure.us/stt/speech/recognition/conversation/cognitiveservices/v1",
      "tts": "https://myspecialservice.cognitiveservices.azure.us/tts/cognitiveservices/v1"
    }
  },
  "translator": {
    "region": "usgovsomeplace",
    "key": "982a516e39c94ab598322fd3dcdad360",
    "baseUrl": "https://mytranslator.cognitiveservices.azure.us/translator/text/v3.0/translate?api-version=3.0"
  }
}
.
.
Disconnected

In a "disconnected" environment, speech and translator services are typically run on one or more server-class computers that are not connected to the Internet. This is a great option for those environments where reachback to a core of some sort cannot be guaranteed or where the properties of the connection (such as bandwidth, latency, security, etc.) is not appropriate for your use.

Here things get a little bit more involved because Cognitive Services operating in disconnected mode are somewhat more granular in nature when it comes to the speech services. For example: if you consider a public cloud setup you need only specify region and key and you're good. In a sovereign environment, you may well still need those two items but will likely also need values for hosts or endpoints.

But there's nothing specific to the language being processed! In other words, in both public and sovereign setups, Azure Cognitive Services can figure out what language is being processed and hide complexities away. But in disconnected mode, you may be running speech-to-text (stt) for US English on one computer, while text-to-speech (tts) for US English may be running on another box. The same applies for any other language - such as STT for French, TTS for German, and so on.

In this situation, your configuration for azure.authentication.speech needs to be more specific - language-specific.

Alright, let's say that you anticipate that you'll be doing STT and TTS for English, French, and German (remember, we're just talking about the speech stuff here, not the translator). Your configuration is going to look something like this:

.
.
"authentication": {
  "speech": {
    "en-US": {
      "endpoints": {
        "stt": "ws://10.100.1.66:7250/speech/recognition/conversation/cognitiveservices/v1",
        "tts": "http://10.100.1.66:7251/cognitiveservices/v1"
      }
    },
    "fr-FR": {
      "endpoints": {
        "stt": "ws://10.27.4.38:6810/speech/recognition/conversation/cognitiveservices/v1",
        "tts": "http://192.44.12.13:6811/cognitiveservices/v1"
      },
      "de-DE": {
        "endpoints": {
          "stt": "ws://192.41.77.3:12000/speech/recognition/conversation/cognitiveservices/v1",
          "tts": "http://192.41.77.3:12001/cognitiveservices/v1"
        }
      }
    }
  },
  "translator": {
    "baseUrl": "http://172.14.88.11:3500/translate?api-version=3.0"
  }
}
.
.

Note how the speech section now has endpoints specific to each language (it could be hosts too - it just depends on your setup) - such as en-US (US English), fr-FR (French as spoken in France), and de-DE (for German as spoken in Germany). Basically we follow the same model as before but settings are made on a language-by-language basis. Also note how values for stt and tts on a language-by-language basis can vary. In the example we have en-US hosted on 10.100.1.66 but on different ports. For fr-FR, speech-to-text (stt) is hosted at 10.27.4.38, while tts is hosted at 192.44.12.13. Finally, de-DE is hosted on it's own machine at 192.41.77.3.

Also note that in the example above we're using non-secure WebSockets for stt (denoted by ws://), whereas tts is using http as the protocol scheme. Your setup may vary - i.e. you may have wss:// or https:// depending on your X.509 certificate setup and security configuration.

Frankly, something this hairy is probably not likely - you'd generally have all this stuff running on a single host - but we wanted to convey that each and every service for each and every language can be different if needed.

By the way, if you have the option of using a public or sovereign cloud for some of your languages in addition to your disconnected languages - outside of, say, English/French/German - maybe Japanese, or Italian, or Spanish; ELS can support that too. Simply provide region/key/hosts/endpoints outside a language-specific enclosure as you'd do for public or sovereign clouds. ELS will use those global settings as the default in the absence of a language-specific section. [Hopefully that makes sense.]

That's pretty much it as far as ELS and the Azure Cognitive Services proxy go. We'll update this document with details on our upcoming proxies for other backends as they become available.

Metrics CSV files

Metrics are written as Comma-Separated Values to text files that can be imported into a variety of processing applications such as spreadsheets and databases. Each CSV record is constructed as follows:

  • date : The UTC datestamp when the record was written.
  • time : The UTC timestamp when the record was written.
  • intervalSecs : The number of seconds preceding the time field for which the record applies.
  • type : The type of operation - transcription, translation, or synthesis
  • info : Information about the ELS pipeline source of the record.
  • audioSecondsRead : The number of seconds of audio received from Azure Cognitive Services.
  • audioSecondsWritten : The number of seconds of audio sent Azure Cognitive Services.
  • charsRead : Number of UTF-8 characters received from Azure Cognitive Services.
  • charsWritten : Number of UTF-8 characters sent to Azure Cognitive Services.
  • audioBytesRead : Number of audio bytes received from Azure Cognitive Services.
  • audioBytesWritten : Number of audio bytes written to Azure Cognitive Services.
  • textBytesRead : Number of text bytes received from Azure Cognitive Services.
  • textBytesRead : Number of text bytes received from Azure Cognitive Services.

Generally, the most interesting values are audio values in seconds and character counts. The more detailed versions - the values in bytes - are simply the low-level traffic being passed between ELS and Azure.

NOTE: There's a big difference between the terms text and characters. That's because we're dealing with a character encoding scheme called UTF-8 wherein for straightforward character sets, 1 byte is used to represent individual characters like A or B or C. For more sophisticated representations - say like Arabic (عربي) or Chinese (汉语;), more than a single byte is used. Hence, if you want to know how many bytes of text were exchanged, look at the textBytes... values. But if you want to know about the number of valid characters that were exchanged, look at the chars... values.

Here's an example showing a conversation between someone on the p-english ELS pipeline, communicating with someone on the p-french pipeline.

NOTE: In this example we've inserted blank lines and explanations inside square brackets ([]). That's simply for purposes of this document.

date,time,intervalSecs,type,info,audioSecondsRead,audioSecondsWritten,charsRead,charsWritten,audioBytesRead,audioBytesWritten,textBytesRead,textBytesWritten
2023-12-19,17:13:26,30,transcription,"p-english",0.000,0.660,0,0,0,10560,0,0

2023-12-19,17:13:56,30,transcription,"p-english",0.000,3.500,42,0,0,56000,42,0
  [
    In the 30-second window ending at 17:13:56, transcription operations on the p-english pipeline ... :
        ... sent 3.5 seconds of audio to Azure (56,000 bytes)
        ... received 42 characters from Azure (42 bytes)
  ]

2023-12-19,17:13:57,30,transcription,"p-french",0.000,0.630,0,0,0,10080,0,0
2023-12-19,17:14:26,30,translation,"p-english",0.000,0.000,47,42,0,0,51,42
2023-12-19,17:14:27,30,transcription,"p-french",0.000,16.510,104,0,0,264160,109,0
2023-12-19,17:14:56,30,transcription,"p-english",0.000,13.890,92,0,0,222240,92,0

2023-12-19,17:14:57,30,translation,"p-french",0.000,0.000,79,104,0,0,79,109
    [
        In the 30-second window ending at 17:14:57, translation operations on the p-french pipeline ... :
        - sent 79 characters to Azure (79 bytes)
        - received 104 characters from Azure (109 bytes)
    ]

2023-12-19,17:15:26,30,transcription,"p-english",0.000,13.560,201,0,0,216960,201,0

2023-12-19,17:15:27,30,synthesis,"p-french",8.337,0.000,0,79,133400,0,0,79
    [
        In the 30-second window ending at 17:15:27, synthesis operations on the p-french pipeline ... :
        - received 8.377 seconds of audio from Azure (133,400 bytes)
        - sent 79 characters to Azure (79 bytes)
    ]

2023-12-19,17:15:56,30,translation,"p-english",0.000,0.000,325,293,0,0,355,293
2023-12-19,17:15:57,30,transcription,"p-french",0.000,12.020,13,0,0,192320,14,0
2023-12-19,17:16:26,30,transcription,"p-english",0.000,2.640,7,0,0,42240,7,0
2023-12-19,17:16:27,30,transcription,"p-french",0.000,11.740,55,0,0,187840,56,0
2023-12-19,17:16:56,30,transcription,"p-english",0.000,5.820,14,0,0,93120,14,0
2023-12-19,17:16:57,30,translation,"p-french",0.000,0.000,71,68,0,0,71,70
2023-12-19,17:17:26,30,transcription,"p-english",0.000,2.670,7,0,0,42720,7,0
2023-12-19,17:17:27,30,synthesis,"p-french",8.113,0.000,0,71,129800,0,0,71
2023-12-19,17:17:57,30,translation,"p-english",0.000,0.000,32,28,0,0,32,28
2023-12-19,17:17:58,30,transcription,"p-french",0.000,11.250,28,0,0,180000,28,0
2023-12-19,17:18:27,30,transcription,"p-english",0.000,4.960,64,0,0,79360,64,0
2023-12-19,17:18:28,30,translation,"p-french",0.000,0.000,36,28,0,0,36,28
2023-12-19,17:18:57,30,translation,"p-english",0.000,0.000,81,64,0,0,92,64
2023-12-19,17:18:58,30,synthesis,"p-french",5.638,0.000,0,36,90200,0,0,36
2023-12-19,17:19:27,30,transcription,"p-english",0.000,8.660,78,0,0,138560,78,0
2023-12-19,17:19:28,30,transcription,"p-french",0.000,3.040,15,0,0,48640,15,0
2023-12-19,17:19:57,30,translation,"p-english",0.000,0.000,93,78,0,0,97,78
2023-12-19,17:19:58,30,transcription,"p-french",0.000,11.910,46,0,0,190560,47,0
2023-12-19,17:20:27,30,transcription,"p-english",0.000,22.120,56,0,0,353920,56,0
2023-12-19,17:20:28,30,translation,"p-french",0.000,0.000,52,61,0,0,52,62
2023-12-19,17:20:57,30,transcription,"p-english",0.000,0.000,214,0,0,0,214,0
2023-12-19,17:20:58,30,synthesis,"p-french",6.900,0.000,0,52,110400,0,0,52
2023-12-19,17:21:27,30,translation,"p-english",0.000,0.000,350,270,0,0,356,270
2023-12-19,17:21:57,30,synthesis,"p-english",66.950,0.000,0,928,1071200,0,0,983
.
.

Using The Data

The CSV files created by ELS can be imported into your favorite spreadsheets or databases for processing. But if you're looking to get a quick idea of totals in your CSV files, here's a simple little Python script (elsmp.py) you can run to total up the key items. Use it in the format:

$ python elsmp.py file [<file_pattern>] [<file_pattern>] [...]
#
# Engage Linguistics Metrics Processor
# Copyright (c) 2023 Rally Tactical Systems, Inc.
#

from __future__ import print_function
import sys
import csv
from datetime import datetime

appVersion = '0.1'
latestTimestamp = datetime.strptime("1970-01-01 01:01:01", "%Y-%m-%d %H:%M:%S")
earliestTimestamp = datetime.strptime("2100-01-01 01:01:01", "%Y-%m-%d %H:%M:%S")
totalFilesProcessed = 0
totalCsvRecordsProcessed = 0
totalAudioSecondsRead = 0
totalAudioSecondsWritten = 0
totalCharsRead = 0
totalCharsWritten = 0


# --------------------------------------------------------------------------
def processFile(fn):
    global latestTimestamp
    global earliestTimestamp
    global totalFilesProcessed
    global totalCsvRecordsProcessed
    global totalAudioSecondsRead
    global totalAudioSecondsWritten
    global totalCharsRead
    global totalCharsWritten

    totalFilesProcessed = (totalFilesProcessed + 1)

    recordCount = 0

    with open(fn, newline='') as csvfile:
        csvData = csv.reader(csvfile, delimiter=',', quotechar='"')
        for row in csvData:        
            recordCount = (recordCount + 1)
            if recordCount > 1:
                totalCsvRecordsProcessed = (totalCsvRecordsProcessed + 1)            

                csvDate = row[0]
                csvTime = row[1]                
                dateStamp = datetime.strptime(csvDate + ' ' + csvTime, "%Y-%m-%d %H:%M:%S")
                if dateStamp > latestTimestamp:
                    latestTimestamp = dateStamp
                if dateStamp < earliestTimestamp:
                    earliestTimestamp = dateStamp

                csvAudioSecondsRead = float(row[5])
                csvAudioSecondsWritten = float(row[6])
                csvCharsRead = int(row[7])
                csvCharsWritten = int(row[8])

                totalAudioSecondsRead = (totalAudioSecondsRead + csvAudioSecondsRead)
                totalAudioSecondsWritten = (totalAudioSecondsWritten + csvAudioSecondsWritten)

                totalCharsRead = (totalCharsRead + csvCharsRead)
                totalCharsWritten = (totalCharsWritten + csvCharsWritten)


# --------------------------------------------------------------------------
def showSyntax():
    print('usage: python elmp.py metrics_file [...]')


# --------------------------------------------------------------------------
if __name__ == '__main__':
    print('----------------------------------------------------')
    print('ELS Metrics Processor v%s' % (appVersion))
    print('Copyright (c) 2023 Rally Tactical Systems, Inc.')
    print('----------------------------------------------------')

    if len(sys.argv) < 2:
        showSyntax()
        exit()
    else:
        for x in range(1, len(sys.argv)):
            processFile(sys.argv[x])
    
    print('across %s hours from %s to %s' % (str(round((latestTimestamp - earliestTimestamp).total_seconds() / 3600, 2)), str(earliestTimestamp), str(latestTimestamp)))
    print('')
    print('   audio seconds received .. : ' + str(round(totalAudioSecondsRead, 2)))
    print('   audio seconds sent ...... : ' + str(round(totalAudioSecondsWritten, 2)))
    print('   characters received ..... : ' + str(totalCharsRead))
    print('   characters sent. ........ : ' + str(totalCharsWritten))
    print('')

Here's an example run in the directory above the metrics directory:

% python elsmp.py metrics/*
----------------------------------------------------
ELS Metrics Processor v0.1
Copyright (c) 2023 Rally Tactical Systems, Inc.
----------------------------------------------------
across 34.4 hours from 2023-12-19 17:13:26 to 2023-12-21 03:37:15

   audio seconds received .. : 125.45
   audio seconds sent ...... : 188.74
   characters received ..... : 2860
   characters sent. ........ : 2860

Supported Languages

Different backends (connected via the proxy) have varying degrees of support for languages. Here's what we have so far.

Microsoft Azure Cognitive Services:

https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/language-support?tabs=stt#supported-languages

But Wait, There's More

We're betting that you're looking at the ELS lingo.json file above and thinking to yourself that most of that stuff is really a template with just the definition of sessions, their member groups and, for each group, what its ID and language code is. You're also maybe wondering how that can be made less of a hassle to configure.

Well we have good news... Go check out our lmc bash script. Its a cool little meta-compiler that takes a template "meta" file and generates the JSON you need. For example, the JSON from above for our restaurant and bar setups would be written in YAML-like syntax and given to the lmc script - resulting in JSON that looks pretty much like above.

Here's an example of that:

+bistro-le-souffle
    -waiters-french:fr-FR
    -diners-english:en-US
    -diners-german:de-DE

+las-taberna
    -bartender-spanish:es-ES
    -patrons-italian:it-IT
    -patrons-estonian:et-EE
    -patrons-german:de-DE
    -patrons-zulu:zu-ZA

Much cleaner and easier don't you think!? Check it out, you'll like it - https://github.com/rallytac/pub/tree/main/misc/lmc.