Engage Latency - rallytac/pub GitHub Wiki
The Realities Of Latency
People often ask us what kind of latencies they can expect when using an Engage-powered application. They want to know what the user experience is when someone talks and someone else hears. In particular, they want to know just how long it takes from the moment someone says something to the moment the other person hears them.
The answer: “It depends…”
It depends on a great many things, some of which are straightforward to understand, and others that require explanation. An usually typical consideration is the network that sits in between the users, and it's always an area of great debate, design, and architecting. But there’s a lot more to it than the network.
So, in this article we’ll explore each and every step along the way that voice takes from the moment it leaves someone’s mouth to the moment it reaches the other person’s ears. We’re really going to get into the weeds on this one so better to find a nice quiet spot to read through this. Maybe get a drink or snack while you’re about it.
Latency Measurement
The first thing we need to talk about is how latency is measured and what people mean when they measure and quote latency figures. Very often we see folks quote ridiculously low, almost impossible numbers for latency; only to find out that they’re talking about how long it takes for network traffic to get around. This is by no means a true reflection of latency - at least not from the perspective of people talking to each other. What needs to be measured - and quoted - is what the actual time is from the exact moment someone says something to the actual moment the other person hears them.
We’ll get into hardcore measurements and such later on in this article but we’ll first cover all the steps involved to make voice communication over a network possible. For this, we’re going to reference the following diagram that shows two entities (devices) labeled “TX” and “RX” respectively. TX has a microphone connected to it and the transmitting user is speaking into this microphone. RX is the receiving end and has a speaker connected to it. These entities could be desktop computers, mobile phones, tablets, and so on - basically any computing gear that Engage will run on (which is pretty much anything these days).
You’ll note that each entity has three large boxes - two dark gray and one light gray. The dark gray boxes represent the combination of device hardware (such as a laptop or phone) and the operating system running on it (such as Windows or Android). The light gray box is the Engage-powered application running on the device.
Inside each of these “category boxes” there are smaller boxes, each representing a functional component employed along the way. Each of these is numbered in red for the transmitting side and green for the receiving side.
The network in between the devices is represented as a (now familiar) cloud and is labeled with a black-backed number (to represent the black hole the IP network can sometimes be).
TX - Sending Audio
1 - Microphone Hardware
The microphone hardware comprises the physical equipment used to convert analog audio (i.e. a person’s voice) into an electrical signal fed into the transmitting device’s audio input connector. Latency involved in this step is present but is so small that it has little bearing on this discussion.
2 - Microphone Driver
Once the electrical signal is received via the connector, a software driver, typically provided by the manufacturer of the microphone or device itself, converts the electrical signal into digital data in the form of “samples”. While the number of samples vary based on the quality of the audio desired; Engage typically requests for 16,000 samples per second from the driver. (This is known as “wideband” sampling which results in significantly improved audio quality.) The amount of latency introduced at this point can vary quite wildly depending on the quality of the audio hardware itself and the quality of the driver. You can expect anywhere from sub-millisecond to 10’s or even 100’s of milliseconds of latency introduced at this stage depending on the quality of the hardware and the driver software.
3 - OS Interface
The OS interface is a driver of sorts that lives between the hardware driver and the application, and is responsible for shifting the samples from the hardware driver to the application. Different operating systems have different approaches here so, once again, latencies can vary significantly. In the case of Apple devices such as iPhones, Macs, and iPads; Apple has full control of the hardware, the driver, and, of course, the operating system (OSX or iOS). Thanks to this level of control, you can expect excellent performance - generally sub-millisecond latency from the moment the electrical signal enters the device to when a sample can be presented to the application. Windows, Linux, and Android systems, on the other hand, are a combination of hardware and software from a variety of vendors so the rule of “your mileage may vary” will certainly apply based on the combination of hardware and OS (and version) you’re using.
4 - Capture Queue
Now, what we’ve spoken thus far is what it takes to deliver 1 sample to the application. But, we can’t work with 1 sample at a time (remember that we’re working with 16,000 samples every second). So, what happens is that we need to buffer up a number of samples before we can actually use them. This buffering is done in a “capture queue” for the microphone which consists of buffers, each containing a number of samples. Engage tries to shorten this queue as much as possible but the minimum that can realistically be achieved is buffer sizes of 10 milliseconds - 160 samples - with at least 3 buffers to cater for data transfer delays in the operating system and device’s motherboard. But, even 10 milliseconds per buffer doesn’t always work well on devices that have other things to do - like browse the web, read email, watch videos, etc. So, to ensure audio quality, Engage generally uses 3 buffers, each of 20 milliseconds (320 samples). [Actually 20 milliseconds is an excellent number because the encoder described in the next step generally can only operate in 20 millisecond chunks anyway.]
But … just because Engage wants 3 buffers of 20 milliseconds each doesn’t mean the operating system will honor that. For example, on variants of Android hardware with different versions of the Android operating system; you can expect buffer sizes sometimes of 1, 2, 3, or 10 milliseconds. And at other times Android will give you buffers of 80, 250, and even 480 milliseconds at a time!!! [It’s REALLY hard to work in such a non-deterministic environment.]
5 - Encoder
Once buffers of samples come out of the capture queue, they make their way into the encoder which converts the samples into a “description” of the audio which will be transmitted to the other end. There’s a LOT of scary math happening at this stage of the game - math that we won’t go into here (thankfully!). Thanks to today’s high-performance hardware though, the math involved does not introduce a delay significant enough to be concerned with. However, because of how the math works, the encoder typically has buffering of its own. This buffering can be up to 20 milliseconds. And that, of course, introduces 20 milliseconds of latency.
6 - Packetization
Once the encoder has finished doing it’s work, the next stage is “packetization” where the encoder’s output is placed into a network packet. Here we have some control over how much audio (in milliseconds) is placed into the packet. If we choose to place 20 milliseconds, we have to wait until there’s 20 milliseconds available from the encoder before we can move forward. If we choose to place 40 milliseconds, then we’ll have to wait for 40 milliseconds, and so on. This, of course, has a significant impact on latency because we’re effectively forcing a delay (but never lower than 20 milliseconds). So we might immediately decide to always do 20 millisecond “framing” as we call it. But … that means we’re going to transmit more packets per second and that’s going to consume network bandwidth. So we have to choose carefully. See Engage Bandwidth for a detailed discussion about bandwidth utilization.
7 - Encryptor
We’re almost ready to send the packet on it’s way but, there’s one more step - encryption. It’s an optional step based on configuration for the particular Engage group and generally does not introduce latency significant enough to take into consideration. But it’s mentioned here for completeness.
8 - Output Network Queue
Whether the packet goes through the encryptor or bypasses it for unencrypted streams, the next step is to submit the packet for transmission to the network. Engage does so through a network “writer” thread (don’t worry about what a thread is) that is optimized for the highest level of performance. While this step almost never incurs latency (it operates at nanosecond speed), holdups in the network stack further down could cause packets to queue a little here - maybe a millisecond or two once in a blue moon.
9 - OS Network Interface
Our packet formally “leaves” Engage onward to the network by being passed into the operating system’s network interface layer and then on to the driver and hardware. In much the same way that audio came in from the microphone, through a vendor-provided driver, and through the OS interface; the packet follows a similar path to the network.
Just as with the variations described above for the microphone, different operating systems behave very differently when it comes to networking. Again, on Apple devices, you can expect excellent performance with little latency incurred at the OS network interface layer. On Windows systems, you’ll likely encounter pretty good performance on desktop-class systems and excellent performance on server-class systems. On Linux, networking really rocks and you should get terrific performance across the board. On Android, though, your mileage will (again) vary based on the hardware manufacturer and Android version.
10 - Network Driver
The network driver is responsible for passing our packet into the hardware. Just as with the microphone described above, the quality of the driver and the quality of the hardware go hand-in-hand in determining how quickly (and reliably) the data is transferred into the network hardware. You may have some latency introduced at point, or almost none at all. It depends ...
11 - Network Hardware
This is where the rubber finally meets the road. (Or, more accurately, “where the bits meet the wire”.) The network hardware may be an Ethernet network interface card, a WiFi or LTE, or some other magic hardware that converts bits into electrical signals or RF transmissions to send the data over the network. Clearly the type of transport (wired, wireless), signal complexity, waveform busyness state, and a bunch of other factors come into play at this point. All are guaranteed to introduce some degree of latency.
OK, let’s take a breather …. And, while breathing, consider that all we’ve done so far is to capture audio from the microphone and give it to the network. And already we’ve probably got around 80 milliseconds of latency so far on our hands thanks to microphone buffer sizes, packetization considerations, and delays caused by hardware and drivers. Fun stuff!!
12 - IP Network
This one is fraught with complexity, frankly. The type of networks you have, how far your packets are going, what they’re going through, the mediums and subnets they’re traversing, and all the other traffic to contend with; is a field of study of its own and we won’t get into it here. Suffice to say that your network could be really amazing or really terrible. And that it can swing between those two extremes at the drop of a hat.
The kinds of latencies you can expect on your network will definitely affect overall latency (and therefore user experience) so it’s highly recommended that you pay very careful attention to this part of the process.
Also, while the raw speed of your network is vital, the quality of the network is equally important. What we mean by this is that the network must not only transport the packets in a timely fashion - it must also do so as error free as possible. For example: if the network drops packets due to traffic load or duplicates packets or corrupts them on their way; the receiving end is going to have to deal with that. [We’ll talk about that later on when we discuss jitter buffers.]
RX - Receiving Audio
13 - Network Hardware
The network hardware on the receiving end operates in reverse of what we saw above in that it pulls in bits from the wire (or the air in the case of a wireless interface) and passes it up to the network driver. The same issues and concerns apply here as it does for network hardware on the transmit side, so, see above.
14 - Network Driver
Just as above, but in reverse, the receiving network driver receives data from the network hardware, packages it up, and sends it upward to the operating system’s network interface. As before, you need to cater for latency introduced at this layer due to hardware manufacturer, driver quality and so on.
15 - OS Network Interface
Guess what!! Pretty much the same applies here as it does for the transmitting side’s OS network interface. We won’t bore you with a regurgitation of the discussion except to say that it’s at this point that the packet has finally made its way to Engage to start dealing with.
16 - Input Network Queue
Engage handles inbound packets in the same way it handles outbound stuff - using dedicated, high-performance threads. Again, don’t worry about what a thread is, except that this thing runs super-fast and is unlikely to introduce latency.
17 - Decryptor
The decryptor is in the inverse of the encryptor described above and brought into play only if the data is encrypted. Either way, almost no latency is introduced at this point but we’re mentioning it in the interests of completeness of description.
18 - Decoder
The decoder is, guess what, the inverse of the encoder on the transmit side. But, unlike the encoder, it generally does not impose a delay. Rather, it decodes the “description” of the audio that was created by the encoder and produces audio samples to match that description. This is called audio synthesis and is essentially a reconstruction of the audio that was input to the encoder on the transmitting side.
Time for a digression here … If it’s not apparent; when you talk to someone over a digital system like Engage (or pretty much any other voice system on Earth today); the voice you hear is not them at all. Rather, the “voice” you hear is a mathematical reconstruction of their voice as heard by an algorithm on the other side. Kinda cool (and creepy) huh!?
19 - Jitter Buffer
Alright, here’s where things can get really really interesting. And the explanation can get long and laborious. So we won’t spend too much time on what a jitter buffer (technically a “de-jitter buffer”) is or the frightening things that it does.
If you want to know more about jitter buffers, a good place to start (as always) is Wikipendia.
Real quick, imagine that you need to water the plants in your garden and they require a smooth, steady flow of water. Any interruptions in the flow and some plants will not get watered, And any overly powerful bursts of water will damage them.
Also, imagine that your good-for-nothing friend, partner, or child is purposely messing with you by jumping on the hosepipe - causing the flow to stop at times, followed by a huge blast of water right after. Clearly you need to be inventive to smooth this out (short of yelling at the perpetrator of this evil deed).
So, what you do is get yourself a nice bucket and make a hole in the bottom. A hole that will allow a smooth stream of water out, regardless of how much water is in the bucket (unless it’s empty of course). Then, you block the hole with your finger for a short time and allow water to flow (bursty and all) into the top of the bucket. When you determine that you have enough water in the bucket for a smooth, sustained flow of water, you remove your finger and the water flows out smoothly. Cool trick, huh!?
Ideally the burstiness on the hose stays consistent so your calculation to determine the height of the water in the bucket that you made at the outset works great. But, if the burstiness is not consistent, your bucket could run dry and you have to block it again for a while; or you’ll waste water if your bucket overflows. Figuring out what the height of the water in the bucket is a black art, as you can imagine.
Let’s add one more item of complexity … this bucket is moving on a rail over your plants and driven by a motor you cannot stop. So, if you have no water in the bucket, or if your finger is plugging the hole, those plants the bucket is passing over will not get watered.
This is essentially what a jitter buffer is. It’s a queue (the bucket) in Engage that needs to supply a steady stream of audio samples (water) to the speaker system (the plants) as consistently as possible with the least amount of interruptions. The audio samples are coming into the system enclosed in packets delivered over a network and hardware infrastructure that we have little control over (the hosepipe). The good-for-nothing perpetrator messing with us is the myriad hardware and software elements in the pathway from the transmitter to us; over which we have little to no control. The best we can do is to deal with what’s been given to us, in the fashion given to us, and water our plants (provide our speakers with audio samples) as smoothly as possible.
There’s a variety of jitter buffer implementations in industry today and their performance vary significantly based on the experience of the folks developing them and the intended use-case. With Engage, we’ve engineered our jitter buffer to adapt as quickly as possible to changing network conditions - especially on so-called disadvantaged networks where packet loss, corruption, delays, and other nasty things are commonplace. To that end, we have numerous settings for the jitter buffer that can be tuned with “hints” from developers and network administrators.
With this in mind, the jitter does, purposefully, introduce latency, to provide the best possible audio experience for users. Our default is a 100 millisecond minimum plus a value calculated on every packet to cater for burstiness on the network (technically known as “interarrival packet delay compensation”) along with other things such as packet reordering, loss, duplication, and corruption.
Now, this minimum of 100 milliseconds is already substantial and when we add the compensation value, it can really push the latency up. So … we have some tricks up our sleeve to reduce the perceived latency for the user by “shaving” our jitter buffer periodically. [Sadly, we can’t go into detail on that concept because it’s a kinda secret thing. But, suffice it to say, that jitter shaving is pretty cool and results in excellent audio quality with the minimal possible latency.]
Shaving aside, let’s assume we get into a worst case scenario and the jitter buffer is holding on to 150 milliseconds of audio at any one time. [This is just a made-up number because it changes all the time, but let’s go with 150 milliseconds.] We can say that the jitter buffer is going to introduce 150 milliseconds of latency - ouch!!
20 - Rendering Queue
The rendering queue is a set of buffers between the jitter buffer and the operating system interface to the audio hardware - the speaker in this case. Just as the capture queue on the transmit side, we need to deliver blocks of audio samples to this interface in a fashion that the operating system likes. Just as with the capture side, mileage varies based on hardware manufacturer, operating system, and so on. Generally, though, you’re probably going to encounter anywhere from 20 - 100 milliseconds of latency at this stage depending on the gear you’re running on.
21 - OS Interface
We’re clearly keeping with the pattern of working in reverse. Here, samples come out of the rendering queue and make their way to the speaker driver through the OS interface. Just like with the capture side, it’s hard to give a number here other than to say that whatever the number is can be anywhere from sub-millisecond and up depending on hardware and drivers.
22 - Speaker Driver
Almost there … Samples coming from the OS interface are passed through the driver to the speaker hardware. Again, mileage varies depending on manufacturer.
23 - Speaker Hardware
Finally!!! The speaker driver pushes the samples into the hardware which converts them to electrical signals that travel to the speaker.
And … voila! - your voice comes out the other side!
By The Way ... At 20 milliseconds framing, all of this is happening 50 times a second, for every second that you talk!!!
Let’s Take A Swag
Alright, let’s see if we can use our new-found knowledge to theorize what latency could look like. We'll assign an estimated latency value (in milliseconds of course) for each step along the way. Sometimes that value will be so small that we’ll assign zero. Other times we’ll have a reasonably good idea. But we’ll also have times where we have no real idea of what it’ll look like.
Just so you know: most of the numbers you see in the table below are based on experience of thousands of deployments we've been involved in for multiple decades. So, its not really a swag for everything. Most of these numbers are pretty representative of what things look like in the real world. But there are some places where we're really just making this up as we go because its honestly impossible to provide real numbers.
Item | Description | Estimated Latency (ms) | Comments |
---|---|---|---|
1 | Microphone Hardware | 0 | We'll assume the hardware has no holdups |
2 | Microphone Driver | 50 | Very much platform dependent, 50 ms is not too bad for modern gear |
3 | OS Interface | 5 | Just a little bit here generally |
4 | Capture Queue | 45 | Very much platform dependent and varies greatly - we'll just use a thumb-suck number of 45 ms |
5 | Encoder | 20 | Pretty much guaranteed to have some coding delay |
6 | Packetization | 60 | 60 ms is the default for Engage - we'll go with that |
7 | Encryptor | 0 | Let's assume no holdup here |
8 | Output Network Queue | 0 | Almost never any holdups here |
9 | OS Network Interface | 5 | We'll just take a guess here - it depends on hardware and OS platform |
10 | Network Driver | 0 | Some drivers are pretty good, some not - we'll take a stab at it though |
11 | Network Hardware | 0 | We'll be forgiving and assume our network hardware is awesome - it usually isn't though |
12 | IP Network | 10 | *A black hole of guesswork - see below |
13 | Network Hardware | 0 | We'll be forgiving and assume our network hardware is awesome - it usually isn't though |
14 | Network Driver | 0 | Some drivers are pretty good, some not - we'll take a stab at it though |
15 | OS Network Interface | 5 | We'll just take a guess here - it depends on hardware and OS platform |
16 | Input Network Queue | 0 | Almost never any holdups here |
17 | Decryptor | 0 | Let's assume no holdup here |
18 | Decoder | 0 | Nothing here |
19 | Jitter Buffer | 160 | A big but necessary latency - remember how you watered your plants! |
20 | Rendering Queue | 45 | Very much platform dependent and varies greatly - we'll just use a thumb-suck number of 45 ms |
21 | OS Interface | 5 | Just a little bit here generally |
22 | Speaker Driver | 50 | Very much platform dependent, 50 ms is not too bad for modern gear |
23 | Speaker Hardware | 0 | We'll assume the hardware has no holdups |
Add all this up and we get 460 milliseconds!
Its a scary number when you're expecting latencies of 100 - 150 milliseconds as many in industry will have you expect. But its just the truth - ain't no two ways about it. And, actually, this is rather optimistic because we've assigned some pretty low numbers for some items. For example: we'd mentioned that some Android devices will deliver microphone audio at frightening amounts like 480 milliseconds at times. Just that is already higher than the entire end-to-end scenario we've estimated and blows our estimation model out of the water.
A word about the IP network black hole
As we'd mentioned, IP network design and deployment is a complex field of study which we won't delve into here. Suffice that its complex and fraught with inaccuracies and non-deterministic behavior unless you've got each and every one of your ducks in a row, wrangled by some expert duck cowboys and cowgirls.
By way of illustrating this; in our testing described below, we used a consumer-grade Netgear AC1000 WiFi access point as our network and only had our test gear connected to it. There was no other traffic present except our test traffic.
We conducted our tests and got our numbers. Then, just for giggles, we did all the tests again but, this time, on a Linksys Velop WiFi network that had dozens of devices connected, huge amounts of realtime traffic like file downloads and uploads, video streaming, music, and whatnot. Of course, we fully expected the latencies on the shared Linksys network to be significantly higher than the dedicated Netgear network.
We were wrong!
The Linksys network - even with all its traffic - consistently performed a whole lot better than the Netgear network. In fact, the Linksys network consistently delivered packets 120 milliseconds FASTER than the Netgear network. You read that right - 120 milliseconds!!! (And we've assigned 10 milliseconds above for our network transport as a guesstimate.) This tells us at least two things: first, different manufacturers make different quality gear (duh!). Second, a network devoid of other traffic does not guarantee performance in and of itself. Third, and most importantly:
YOUR MILEAGE IS GUARANTEED TO VARY!
Now For The Numbers
Let’s get into how we measure latency. And, just to be clear, we’re interested in the actual user experience of people talking to each other, not just a subset of fancy numbers quoted for things like network transit times that only have a partial bearing on user experience.
The way we measure this stuff is to look not just at how long it takes not for packets to traverse networks, or even what happens inside our software. Rather, we’re interested in how long it takes for the electrical signal from the microphone on one device to come out as an electrical signal on the speaker on the other device.
To do this we have a pretty complex setup in our lab but are going to depict it here with something that’s a little easier to understand.
First, we have three Engage-powered devices in our test setup: a Dell XPS laptop running Windows 10 (a pretty standard Windows box), a Samsung Galaxy S9+ running Android 9 (kinda close to top of the line for Android devices these days), and a Samsung Galaxy J3 running Android 6 (a super-cheap Android phone that you can pick up for just a few bucks).
These devices are connected to an inexpensive, consumer-grade Netgear AC1000 WiFi access point because we can’t expect all our customers to have the latest and great network hardware. To minimize interference on the network as much as possible from other devices, we’ve isolated the network to only these three devices. (Though ... see above for our experience with this WiFi access point.)
At a very high level, this is what this setup looks like. (Note the little representations of three gray boxes from the previous diagram inside each device representing the stuff we described above.)
To perform the hardware measurements, we connected a logic analyzer along with an oscilloscope to our devices at various times of our testing. On the transmitting side we attached a probe from the logic analyzer to the microphone line to pick up electrical signals sent from the microphone. On the receiving side we attached a second probe to the speaker output line.
The microphone probe was hooked to channel 0 on the logic analyzer while the speaker probe was hooked to channel 1 on the analyzer.
To perform each measurement, we generated multiple sharp, square-wave signals on the microphone and heard them played out on the speaker. The logic analyzer captured and timestamped each of the inputs - the microphone on channel 0, the speaker on channel 1 - and represented those signals in the oscilloscope output. We then measured the timings between the start of the signal on TX side on channel 0 and the start of the signal on the RX side on channel 1. The difference is the actual latency from speaking to hearing.
Here’s a screenshot of what such a measurement looks like. In this example the delay from the start of the TX signal (channel 0) to the start of the RX signal (channel 1) is 0.4552096 seconds - around 456 milliseconds.
We repeated the same test multiple times, each time measuring the timings. Finally, for each combination of CODEC type, framing size and receive-side jitter buffer tuning, we averaged the individual values to arrive at the numbers quoted below.
The Real-World Numbers
Here's what things looked like for this test setup using:
- Opus CODEC @ 16kbps VBR on the transmitter
- Acoustic Echo Cancellation disabled on the receiver
All measurements are in milliseconds
Framing | RTP Factor | S9 -> J3 | J3 -> S9 | J3 -> J3 | Win -> S9 | S9 -> Win | Win -> J3 | J3 -> Win |
---|---|---|---|---|---|---|---|---|
20ms | 3 | 450 | 300 | 440 | 400 | 330 | 500 | 340 |
20ms | 8 | 480 | 300 | 430 | 600 | 570 | 550 | 460 |
60ms | 3 | 560 | 310 | 480 | 340 | 330 | 600 | 350 |
60ms | 8 | 600 | 400 | 550 | 340 | 580 | 520 | 590 |
You might think we'd be embarrassed by these numbers. We're not! These numbers are high for a a very good reason.
In our test setup, we purposefully used a WiFi access point that was imposing around 120 milliseconds of latency due to RF interference and, basically, being an el-cheapo network device (it costs around $50). We also configured Engage to run in it's default mode where it assumes a disadvantaged (i.e. "bad") network. On top of that, we used some pretty grim equipment as end-user devices.
It certainly makes things look bad. But we're really quite proud of these numbers considering the nasty environmentals we're asked to operate in. In fact, we prefer to test and quote numbers for this type of setup because it's almost always the case that real-world deployments are not running on super-energized networks using devices that usually exceed the budgets of most customers - at least at scale. So, we'll rather lead with numbers that represent what you'll find in your environment than numbers cooked up under our laboratory conditions.
The Sexy Numbers
But ... let's quote some sexy numbers to really impress you (and make ourselves feel better).
For this, somewhat realistic, test case we used a more pricey WiFi access point (one that costs a great deal more than 50 bucks) with Quality of Service enabled for our Engage voice traffic. We still used the Samsung S9+ (it ain't too bad) on the one side but used an Apple Macbook Pro running OSX Catalina on the other end. Then, we told Engage to run in low-latency mode - which is really just telling it that the network and all/most of the hardware that other Engage endpoints are running on are awesome and there's not too much to worry about.
Framing | RTP Factor | S9 -> Mac OSX |
---|---|---|
20ms | 3 | 185 |
20ms | 8 | 190 |
60ms | 3 | 215 |
60ms | 8 | 220 |
Looks good huh!? Of course it does. But, really, don't expect this on every network all the time. Its just not a good idea. In fact, while configuring Engage for low-latency mode makes us look awesome, it increases the risk that we may lose bits of audio if the network or end-users' devices don't behave themselves. While this kind of thing may be OK for consumer-style and even some enterprise-grade environments, we assume that our stuff is running in life-critical environments where even just a little bit of audio loss could be catastrophic. So, as a rule, we err on the side of low-loss and crystal clear audio rather than low-latency.
But, But, But …
What’s that you say ... ?
-
“
These numbers are WAY too high!!
” No, they’re not high, they’re REAL! As we alluded to at the beginning of this article, be very wary of latencies that get quoted. Be sure to check precisely what is being quoted as latency. Ask if the latency is true “mouth-to-ear” and not just a piece of the latency pie. -
“
Two-Way Radios have super low latencies!
” We agree, they’re terrific! But they have to run on dedicated networks that aren’t available everywhere you go, are purpose-built hardware, cost a whole lot more than software running on consumer devices, and do a whole lot less. -
“
I bet that apps like WhatsApp, Slack, Zoom, and Teams kick your butt!
” We bet they don’t on the kinds of networks we have to operate on! Measure their signal-to-signal latencies on the same networks and same devices and get back to us. -
“
OK, these kind of latencies may be OK for Push-To-Talk, but there’s no way it’ll work for full-duplex communications like on cell phones.
” Hmmm, try this. Use your cell phone to call another person’s cell phone who is in the same room as you. Now, say “one
”. When the other person hears your voice on their phone, have them say “two
”. When you hear them on your phone, say “three
”. Go from there - it’s kind of a fun experience. (Hint, its around 250 ms on the latest and greatest hardware.) Look at their lips while you're doing all this. (And remember that this is happening on carrier networks that cost billions of dollars and on cellphones that are at the bleeding edge of technology.)
PS
While we were about it, we thought we'd do some informal latency/quality testing with other stuff out there.
Here's how that looked on 2 brand new iPhone 11 Pros
-
Cellular call on the same AT&T 5G tower: 250 milliseconds
-
WhatsApp WiFi voice call : 250 milliseconds
-
Slack WiFi voice call : 300 milliseconds
The cellular call audio quality was excellent (as one might expect from bazillion-dollar cellular network using highly-tuned cellular phone hardware). Both the WhatsApp and Slack calls sounded great when all was good with the network but, the moment that there was a glitch on the network, audio became choppy with noticable audio loss and corresponding delay while the apps recovered.