Introduction

We're often asked what kind of network bandwidth Engage uses for communications. And the answer, as is usual with techie stuff, is "it depends".

There's a lot that goes into determining how much traffic is going to be produced on your network, when that traffic is going to be produced, and who's going to be producing it. And every use-case for every customer environment is going to be different.

Fundementally, you have to have a pretty good handle on how many users you'll have, how many talk groups they'll be active on (listening as well as talking), where they're located, and what the network looks like between those folks.

A key criteria, though, is in knowing what your network transport capabilities are - specifically whether your network supports bi-directional IP multicast, or not.

This is super-important to know because, in a multicast environment, the underlying network infrastructure takes care of efficient distribution of packet traffic. In such a multicast environment, there's no centralized servers that process and/or forward traffic between users. Rather, users' traffic propogates in what is effectively a one-to-many setup (actually, it's more like "many-to-many" but let's not overcomplicate things).

If your network does not support multicast, you'll have to have some way in which to create a multicast overlay to create a simulated multicast environment. While there's a few ways to do this using 3rd-party tools; our recommended way is to use Rallypoints as high-performance packet forwarders that are optimized for Engage-specific operations.

We're not going to get into the details of IP multicast network design here. Nor are we going to delve into Rallypoint architecture options. Rather, we just want to ultimately present empirical data showing you how much traffic you can expect on your network when using Engage.

Streams

First-off, let's understand that when someone talks, they produce a stream of packets. Those packets contain voice data encoded ("compressed" if you will) with a CODEC (COder/DECoder). The output from the coder is chopped up into blocks (we call them "frames") of audio, and sent out on the network. In any conversation, there will obviously always be at least one stream of packets.

Multicasting

Now, let's imagine that we have 4 other people listening on the group that our user is talking on. In a multicast environment, the amount of traffic on your network is a single stream - because the multicast network infrastructure makes sure not to needlessly duplicate traffic to the receiving endpoints. So, whether we have 4 people listening, or 400, or 4,000; we still only have one stream. And that's terrific because it means we can scale our user base without worrying too much about how our network is going to be affected.

Don't forget, though, that this is for a stream for a single talk group. Each additional talk group where someone is transmitting will create another stream. So, for example, if we have someone talking on the Alpha group and another person talking on Bravo group; we'll have 2 streams on the multicast network. If someone talks on Charlie at the same time; we'll have 3 streams. And so on - you get the idea.

We can summize a rule-of-thumb to say:

In a multicast environment, the number of streams (and therefore bandwidth utilization) is directly proportional to the number of people speaking.

Unicasting

But what if our network doesn't support multicast? Maybe the enterprise network isn't engineered for multicast or we have to traverse public networks (such as the Internet) where multicast isn't available? What happens then?

Here the distribution of traffic from the person speaking to the people listening has to go via some sort of central point that all the users are connected to. We call this "unicasting".

Generally folks refer to such central points as "servers" because that's a pretty well-understood term and, technically, the server does quite a bit of work to coordinate who gets what traffic, who gets let in to the system, who can talk under what conditions, and so on.

In an Engage environment we do have "servers" of sorts that handle the task of forwarding traffic between users. But that's pretty much all they do - they simply forward traffic. Things like access control, encryption management, floor management, audio transcoding, and other functions usually conducted by servers are delegated to the Engage endpoints themselves because Engage operates under the assumption that the network is multicast in nature and that the are no servers. But let's not get further down the path of why Rallypoints are better, that they're not servers, what their inner workings are, or to expound on their capabilities. Let's just see Rallypoints as a means to create a "multicast-like" environment on non-multicast networks. With the understanding, though, that when using Rallypoints, network traffic utilization is pretty similar to what you'd see in traditional, server-based unicast systems.

Back to the discussion...

In a unicasting environment, a stream coming from someone and being sent to others through a central point creates multiple streams on the network. The count of those streams is comprised of the stream coming from the person transmitting, and one stream each for each person receiving it.

We can make another rule-of-thumb to say:

In a unicast environment, the number of streams (and therefore bandwidth utilization) is directly proportional to the number of people speaking AND listening.

Obviously unicasting is WAY less efficient than multicasting and will not scale cleanly in the same way that multicasting does. So now we really do have to concern ourselves with how many people may be sending traffic as well as how many people will be receiving it. And, of course, we have to be concerned (as with multicasting) with how many talkgroups we have in the system because that's a big multiplier in calculating bandwidth utilization.

How Do We Get The Numbers

Packet Structure

We're going to talk a bit below about payloads and headers and "taxes" and other stuff. Keep these drawings in mind as you read through.

Clear (Unencrypted) UDP Packet

Secure (Encrypted) UDP Packet

A Variety Of CODECs

Remember how we spoke about CODECs and how the output from those CODECs is chopped up and sent over the network? Well, different CODECs produce different outputs of varying quality traded against bandwidth utilization. The choice of which CODEC to use is based on a number of of things including computational complexity on the devices processing the audio, the quality of the audio, and the network bandwidth required to transport the output from the CODEC, and interoperability with systems that are limited to certain CODECs.

Historically, computational complexity was always a big factor as CPUs were pretty slow by today's high-performance hardware standards and, therefore choices were often made to go with CODECs that would not kill hardware - particularly on end-user devices. So we have an historical proliferation of CODECs such as G.711 which uses 64kpbs of bandwidth with very low computational complexity. But that stuff is mostly legacy. With newer hardware we're free to choose more computionaly intensive CODECs that give us an even better audio experience and vastly reduced bandwidth requirements. For example: G.711 is fixed at 64kpbs and produces high-quality audio at low complexity. Opus (the darling of the CODEC world for the past few years), produces audio quality that is even better than G.711 with a much-reduced bandwidth foorprint. In fact, Opus operating at 16kbps produces higher audio quality than G.711 operating at 64kbps. And, with Engage, you can go all the way down to 6kbps on Opus! Pretty cool huh!?

OK, great, let's take a look at Opus at 16kbps (the configuration we most often recommend). You'd think that someone transmitting audio using that configuration is going to use 16kbps on the network - right? Wrong!

There's a whole lot of devil in the details. The frst thing to consider (as is true for so much in life) is taxes. Yes, taxes!!

But not taxes as in what you pay to the government, these are "taxes" paid to the network in order to get your packets from point A to point B. (It's really more like the coin you'd need to pay the ferryman Charon to cross the River Styx.)

Packet Taxes

Every time a packet of audio (the payload) is sent from a device, it includes information that assists the network in moving the packet around as well as information the far-end receiving device needs to know how to handle the payload. Its not unlike writing a letter to someone where the letter is the payload and the envelope it's mailed in is the information the postal service needs to get your letter to the receipt.

This information is tacked on to the beginning of the transmitted packet in the form of headers. (See the drawings above.)

Packet Headers

There's quite a few headers that travel with the payload. First is a header that the receiving endpoint uses to know what do to with the payload. For most of Engage's traffic this header is the Realtime Transport Protocol, or RTP, header.

Then, when using IP multicast, that resulting RTP packet (consisting of the payload "wrapped" in RTP) is further wrapped in a User Datagram Protocol, or UDP, header.

Then ... that UDP packet is wrapped in an Internet Protocol, or IP, packet.

Finally, the IP packet is wrapped in a packet (sometimes called a frame just to confuse everyone) specific to the underlying transport. This transport could be wired or wireless Ethernet or some other construct. (We'll just talk about Ethernet here - which is pretty much what most networks use anyway.)

Each of these headers uses a certain number of bytes for each packet, obviously, and therefore bandwidth calculations need to take this into account.

You're not to like hearing this but packet overhead (taxes) generally takes up more bandwidth than the actual payload! For example: let's say your CODEC produces a payload of 10 bytes because its super-optimized and operates at a very low output rate. Your resulting transmitted packet size (what actually goes out over the network) consists of these 10 bytes + 12 bytes for RTP + 8 bytes for UDP + 20 bytes for IP + 14 bytes for Ethernet! So, sending a payload of 10 bytes results in 64 bytes (10+12+8+20+14) being sent!!! Just crazy huh!? It's not unlike paying a mortage where your taxes (the headers) form the majority of your monthly payment and your paydown on principle (the payload) forms a very small part.

But wait, there's more ...

That's the fixed, guaranteed tax we have to pay the network to get our packet moved around. No matter what our payload is and regardless of size (for the most part), we're going to have these headers on our packets chewing up our bandwidth.

The "more" here is encryption.

Encryption

So far we've talked about traffic that is not encrypted. But what if we want to secure our data and, ideally, hide as much of our data away from the bad guys?

The answer here is to encrypt our payload and, in Engage's case, the RTP header as well. (At the Engage level we can't encrypt the UDP, IP, or Ethernet headers as the network won't know what to do with that.)

For performance and security reasons Engage uses the Advanced Encryption Standard, or "AES", encryption algorithm which produces encrypted data in 16-byte blocks. So, if we encrypt 1 byte of data, our encrypted output is 16 bytes. If we encrypt 15 bytes, it is still 16 bytes. If we encrypt 16 bytes, however, the output is 32 bytes. If we encrypt between 16 and 31 bytes, the output is still 16 bytes. Encrypting 32 bytes, will produce 48 bytes of output. And so on. Basically, AES encrypts on 16-byte "boundaries". That extra "padding" (which varies in size) is going to affect the size of the data we're transmitting.

Also, Engage operates AES in what's known as Cipher Block Chaining, or CBC, mode. And CBC requires that an extra blob of random data is added to the encrypted output - 16 bytes of random data in fact. This is known as the Initialization Vector, or IV.

Together, the 16-byte block operation and the IV added to each packet obviously substantially increases the size of the packet - and therefore our bandwidth utilization. (It's all pretty scary when you come to think of it!)

Packet Framing

OK, time to take a breather - there's a lot of stuff going on.

Breath taken. Let's continue.

We now know that the overheads placed on our poor little payload is quite substantial in order to get it plopped into a packet and sent over the network. But we have some wiggle room. Instead of little payloads, we can send bigger payloads! Think about it. If our overhead for every unencrypted packet is 54 bytes (see above where the payload size is 10 bytes and the overhead is 54, resulting in a packet size of 64 bytes); the best way to reduce the bandwidth utilization is to reduce the number of packets being sent. Straightforward, no?

Let's say that the 10 bytes of payload consituted 20 milliseconds of audio. That means that for every 1000 milliseconds (1 second) of audio, we'd be sending 50 packets (1000/20). Now, the payload in each of those packets is 10 bytes and the overhead is 54 bytes. Therefore, every second, we'd be sending 500 bytes of payload (10 bytes of payload * 50 packets) and 2,700 bytes of overhead (54 * 50). The total number of bytes we'd send every second is therefore 3,200 bytes (500 + 2,700).

[That's a horrifying set of numbers that just makes programmers' skin crawl.]

But ... let's say we changed the payload size and, instead of sending audio every 20 milliseconds, we rather send at 40 millisecond intervals. Now, the number of packets we'd send in a second halves to 25 packets. The total amount of payload we send is still 500 bytes (now 20 bytes per payload but only 25 packets in a second - i.e. 20 * 25 = 500) but the total overhead is now only 1,350 bytes (54 * 25). That's a full 50% reduction in taxes! (Wouldn't it be nice if we could get those kinds of tax breaks from the government 😀!! ) This is much better.

The tradeoff, though, is we now have to wait for 40 milliseconds every time before we send our audio; and it means receivers have to wait for 40 milliseconds to receive audio. That increases audio latency and is not desireable. But in a Push-To-Talk environment, that latency is not perceived by humans so we're good.

There's a bigger risk here though. That being that if a 20 millisecond packet gets lost on the network, the receiver only looses 20 milliseconds of audio. If a 40 millisecond packet is lost, the audio dropout is larger. But, again, we're talking about 40 thousandths of a second - which most humans won't perceive.

We can go even bigger, we can send 60 millisecond packets, or 80, or even 100 millisecond packets. We'll see our taxes drop dramatically and life will be better than we could ever have imagined. Consider if we sent 100 millisecond packets: the payload size for 100 milliseconds would now be 50 bytes (10 bytes for 20 milliseconds * 5 ... because 100 / 20 = 5), and we'd only be sending 10 packets. Therefore, in a second, our bandwidth consumption would be 500 payload bytes (as always) plus 540 bytes of taxes. A total of 1,040 bytes in a second vs 3,200! That ain't bad.

All this stuff is equally relevant if our streams are encrypted - but with extras added on for the encryption IV and padding. Basically, we can treat encryption as simply another tax which is optional.

IMPORTANT

As these numbers get bigger and networks drop packets, and packets are delayed; audio quality does start suffering. So we need to be very careful on what numbers we choose.

The considerations above drive us ultimately to packet framing which is simply to say the sizing of the audio payload that we transmit. You'll see in the tables below how sizing changes bandwidth consumption. These numbers are as accurate as possible but, as with anything technical, there's always some wiggling that needs to be considered.

For instance CODECs such as G.711 and GSM (which Engage supports as of this writing) are fixed in size and always produce a reliably-sized output. Other CODECs such as AMR and Opus (Both of which Engage supports as well) are variable in size - meaning that they will try to actively reduce the amount of traffic based on how much the sender is saying, what their volume level is, complexity of their voice, and so on. With these so-called "VBR" CODECs (Variable Bit Rate) you may sometimes see lower output sizes than listed in the tables. But you should not see higher values too often. Best, though, to add a "fudge factor" of 1kbps or so to the numbers you'd like to use.

UDP Or TCP

There's one more thing to talk about - and that's whether we're using UDP or TCP (Transmission Control Protocol).

When multicasting, Engage uses UDP as its transport and the numbers and discussion above applies. However, when unicasting via Rallypoint connections, Engage uses TCP as its transport. Like UDP, TCP has overhead associated with the protocol such as link establishment, packet acknowledgements, heartbeats, and whatnot. [TCP is pretty complex and we won't get into it here. It's best to check out resources such as Wikipedia] for more info.

Also, when we use Rallypoints, TCP is use to convey packets using Transport Layer Security. TLS imposes further overhead which needs to be taken into account.

The niggly thing with TCP and TLS, though, is that calculating bandwidth utilization is a little of a dark art as bandwidth usage in that scenario is prone to current network conditions and can vary somewhat. So, what we've done in the table showing TCP over TLS links is to calculate utilization as best we can and then compare it to multiple measurements of actual traffic under Internet conditions. We've then combined those calculations with the observed values to arrive at what we believe to be best representative of what unicast traffic utilization looks like with the caveat that your mileage may vary based on prevailing network conditions.

The Numbers

Finally! Here's the meat of this document - the tables showing bandwidth utlization for the CODECs supported by Engage. We've split these into 2 tables - one for UDP multicasting and one for TCP/TLS unicasting. In each table we show the breakdown per CODEC with different packet framing and, for each of those, what things look like if your traffic is clear or encrypted.

It's a bit of an eye-chart so you may want to zoom in on this stuff if your screen is a little small.

Multicast UDP - IPv4 over Ethernet II

Unicast TCP - IPv4 over Ethernet II & TLS 1.3

Another View

UPDATE: Recently (yesterday in fact) we added support for Speex to Engage. Speex is considered a "legacy" CODEC these days as its been superceded by Opus (which just happens to be Engage's preferred CODEC). Speex, though, is still used in many systems that have been deployed by enormous organizations such as governments so we figured it'd be good idea for our stuff to work with that stuff. So we added Speex.

Now, there's quite a few reasons that Speex was (and still is) so popular - including patent-free licensing, open-source, portability, and excellent quality. One of the real biggies here, though, is that Speex can encode audio waaaay down to very low bandwidth environments. For example, Speex audio encoded at 2.15kbps is not too bad and really useful in bandwidth-constrained environments. (Opus, on the other hand, will only go down to 6kbps, and AMR to 4.75).

"That's wonderful" you say - let's use Speex all the time and save ourselves lots of bandwidth. Well, yeah, sorta. As you saw above, we reach a point of diminishing returns in terms of quality over bandwidth savings. Also, just because the CODEC is compressing the hell out of the audio; the tax described above still puts a damper on things.

So we decided to muck around with Wireshark a little and do a comparison between Opus packets and Speex packets.

In this setup, we had an Engage client configured with two groups. The first used Speex, the other Opus. The Speex group was configured to operate at Speex's lowest rate of 2.15kbps, while the Opus group was configured for the Engage default of 16kbps. (Yeah, that's like 6 or 7 times the bandwidth usage of Speex at 2.15kbps but just bear with me on this.)

Both were configured for 60ms framing so that at least IP tax would be the same for both groups.

Then we ran two tests. The first involved the groups transmitting with encryption turned OFF. Then we did it again, this time with encryption turned ON.

Here's the I/O graph that Wireshark gave us.

On The Left

Here we see two lines: red being the unencrypted Opus traffic, the blue being the unencrypted Speex traffic. Speex clearly uses a whole less bandwidth than Opus. That's expected of course. However, even though we're encoding at 2.15kbps for Opus, the IP tax results in bandwidth utilization of a little over 10kbps!

Opus encoding at 16kbps results in bandwidth utilization of around 25kbps.

On The Right

The right side of the graph shows the same thing but, this time, we encrypted both groups. You'll see that crypto adds about 2.5kbps to the bandwidth usage.

When we look at audio payload only; crypto about doubles the payload size for Speex at 2.15kbps whereas, with Opus, crypto adds around 15% when Opus is producing a 16kbps stream.

In The Middle

The middle portion of the graph tries to represent these findings as a bar. In the case of Speex, the audio payload is the blue block. For Opus, the audio payload is the red block. Those are understandbly different of course.

In both cases, though, the IP tax is pretty much the same - the pink block. As is the crypto portion.

Bottom Line

The graph above is a great way to see that even when you're massively sacrificing quality (Speex at 2.15kbps sounds terrible compared to Opus operating at 16kbps); the final bandwidth savings gains are not as dramatic.

In the worst case (the right side), encrypted Speex at 2.15kbps comes in at around 14.5kbps, and Opus at 16kbps comes in around 30kbps. So although Speex at 2.15kbps is 7 times less bandwidth hungry than Opus at 16kbps (close to 90% more bandwidth efficient); the reality is that taxes reduce our savings to around 50% versus the almost 90% we'd like to get.

(By the way, this is, in fact, the bottom line. Literally and figuratively!)

Engage Bandwidth - rallytac/pub GitHub Wiki

Introduction