CS AWS Load Testing - rallytac/pub GitHub Wiki

Case Study: Load Testing in Amazon Web Services

Introduction

Customers and partners often ask us for guidance and assistance for setting up testing environments where they can exercise their Engage-based applications, determine what their infrastructure needs to accomodate, set expectations for performance, and estimate costs for supporting user bases.

Here's a case study for a configuration we put together for a partner where they're using Engage-based clients on a variety of platforms connected into a Rallypoint mesh running in Amazon Web Services.

Architecture

Below is a very simplistic representation of what we created for this test. We have a bunch of clients (c1 - cN), a load balancer, and a number of Rallypoints (rp1 - rpX) that make up a Rallypoint mesh.

Clients can be anywhere from 1 instance (c1) to whatever number of clients is needed for the test (cN). So, if we were to test with 350 clients, N would be 350, we'd have c1-c350. All of those clients hit the load balancer which then divies up the connections to the Rallypoint mesh.

Similarly, our Rallypoint mesh consists of one or more Rallypoints. If we only have one Rallypoint, only r1 would be fired up and running. As the mesh grows, r2 would come online, r3, r4, and so on. So, if we have 8 Rallypoints in the mesh, we'd have r1-r8.


                                          +---+ 
                        +---------------> |rp1| <-------------+
                        |                 +---+               |
                        |                   ^                 |
                        |                   |                 |
                 +-------------+            |                 |
  c1-----------> |             |            +------->  +---+  |
  c ... -------> |Load Balancer|-------------------->  |rp2|  |
  cN-----------> |             |            +------->  +---+  |
                 +-------------+            |                 |
                        |                   |                 |
                        |                   v                 | 
                        |                 +---+               |
                        +---------------> |rpX| <------------+
                                          +---+         

The Mesh

For this project, we created a Rallypoint mesh behind an Amazon Elastic Load Balancer performing simple round-robin assignment of client (Engage endpoint) connections into the mesh.  We created a "gold" image of a Rallypoint machine running Amazon Linux and hosting the Rallypoint executable.  This image is configured to read a file shared across all instances that describes the mesh - i.e. which Rallypoints every other Rallypoint should connect to to establish the mesh. 

This mesh file has new Rallypoint entries added to it as the mesh scales up - i.e. when new Rallypoints need to be added.  And it has entries removed when the mesh scales down - i.e. when Rallypoints are removed from the mesh because the load requirement has diminished.  Every time a new Rallypoint needs to be spun up, we instruct Amazon to spin up a new instance based on this gold image.

At the time of this writing, we are scaling the mesh up and down manually rather than have Amazon Web Services do so automatically.  However, the desire is make that an automatic thing by having Rallypoints passing their individual load metrics to the load balancer.  This is pretty simply done using a simple Linux shell script on each image that periodically reads the Rallypoint's status file to pick up key metrics such as number of clients, throughput, realtime media pathways, and so on. These metrics are pushed into Amazon Cloudwatch which, in turn, is queried by the Amazon Elastic Load Balancer to determine when to spin up new Rallypoints or, as demand drops off, to take Rallypoints offline.

Load Testing

To test the load and performance of the mesh, we created another gold image that is also running Amazon Linux but, this time, using our "engage-cmd" command-line tool.  Each of these images fire up a test configuration that launches 100 instances of engage-cmd - with each instance of engage-cmd representing a single user.  These client instances are pointed to the load balancer which, in turn, connects the client to an appropriate Rallypoint based on the balancing strategy (currently simple round-robin as described above).

Now, to simulate a reasonably good representation of what one might expect in a real-world environment, we configured engage-cmd to auto-generate and use a mission that consists of a number of groups/channels.  Our thinking here was to have 60% of users on a single active channel, 20% on missions that have 4 active channels, 10% of users on missions with 8 active channels, 5% on 16 active channels, and 5% on missions with 32 channels.

This means that if I were on a single channel mission, that channel would have up to 59 other members. If I were on a mission that has 4 channels, there would be up to 19 others on each of my channels, and so on.

Once the instance of engage-cmd starts up - with whatever mission configuration it has - it proceeds to randomly transmit on any of the audio channels it has assigned to it.  Transmission time is anywhere between 2 and 7 seconds, followed by 10-15 seconds of inactivity before it does so again.

Now, in the real world, people are not transmitting all the time in a push-to-talk environment.  In fact, they spend the vast majority of their time NOT talking.  This varies by sector. For example: in Public Safety, user talk time is typically about 1.5%, while commercial environments are around double that at 3%. Our general assumption is a little larger than that with a typical user talk time of 5%.  For this load test, however, we assumed that our users are extremely chatty and multiplied our baseline by 5 - resulting in each user transmitting 25% of the time that they use the application.  In other words, in a typical day, our user would spend 75% of his or her time idle, and be talking flat-out for 25% of that time - clearly not exactly real-world but a pretty good number to aim for when doing performance testing.

Also, frankly, if we were catering to teenagers with this test, we'd probably have to up the talk time to 65%!

The Results

There's no excitement in simply just describing what we did. We've got to tell you how this all worked out...

Well, the results were terrific! We decided to see what things would look like if we went with a really low-cost option for the Amazon Linux instance running the Rallypoints. Specifically, we ran our Rallypoints on pretty small X86 64-bit machines consisting of just 2 Virtual CPUs and 2GB of RAM - so generally about as powerful as a modern-day cell phone ... for a server!!!

While we ran a whole bunch of Rallypoints for giggles, we really just needed very few - 2 in fact.

We then fired up 45 instances of the client (engage-cmd) images. Each of those ran 100 instances of engage-cmd - resulting in a final user count of 4,500.

Those 4,500 users connected into our little Rallypoint machines at around 20 clients per second, performed all the heavy-duty mathematics necessary for Rallypoint and Engage client authentication, and started yapping away following their automated scripts.

Once everything was cooking with all the clients connected and everyone talking like crazy (remember we ran this test at 25% TX rate vs a more reasonble 3% or even 1.5%), CPU utilization on those little Rallypoint machines levelled off to average at around 30%.

On both machines, memory consumed (including the Linux operating system) came in under 150MB. So that 2GB of RAM is really not necessary!

Can you see this ... ?

Here's a screenshot of two SSH terminal windows for a test we ran with 2,000 users - one for each Rallypoint. Both are running the htop utility showing CPU and memory stats in realtime. The terminal on the left is split into 3 panes - the top being the htop command, and the two other panes showing the output of a bash script named rpstatus.sh. This script reads the Rallypoint status file and displays key information. Columns in the rpstatus display are as follows:

Column Description
Timestamp The date and time (UTC) when the stats were recorded.
TP Rate Throughput rate of the Rallypoint - in bits per second - when the stats were recorded.
TP EMA The smoothed exponential moving average of the throughput rate across the process lifetime.
Actv Conns Number of active network connections - including clients, peers, and multicat reflectors.
Tot Conns Total number of process lifetime network connections - including clients, peers, and multicast reflectors.
HRC Rate Number of health checks per second received from the load balancer.
Clients Number of active client connections.
Peer Number of active peer connections (connections to other Rallypoints in the mesh).

NOTE: You may need to open the image in a new tab or window in your browser to actually see the detail.

image

The Very Geeky Stuff

OK, now if you're REALLY into the geeky stuff and want to do something similar, here's the scripts we used to simulate the client-side stuff.

We have two scripts here - one written for the bash shell which is run either from a terminal or run from a daemon configured in systemd on the Amazon Linux host; the other being the script that engage-cmd runs to execute our test cases.

The bash script (load.sh)

First, the bash shell script that fires up the instances of engage-cmd. The meaty part of the script starts out by checking o see if we're running inside of our Amazon instance, or whether it's running on a developer's Mac, or on a developer's Linux box. Based on that determination, the path to engage-cmd changes. But that's all that this piece is doing.

Now, as described above, engage-cmd is told to auto-generate a mission configuration. For this, we need to have a unique passphrase which Engage will use to auto-create the mission, making it the same for all users on the particular machine for that number of channels in their mission. We construct a passphrase consisting of a random number and assign that to the INSTANCE_PASSPHRASE_PORTION variable and hold onto it.

Once this is done, the script goes through a loop - up to the number of instances we want to start - firing up engage-cmd with parameters specific to that lopp iteration.

The first part of the loop determines, for that loop iteration, what mission confguration is going to be used: 1, 4, 8, 16, or 32 channels based on the 60% / 20% / %105 / 5% / 5% spread described above.

Then, we finally construct the mission-generation passphrase (called PP in the code) by combined the computer's host name, INSTANCE_PASSPHRASE_PORTION, and the channel count we want. This information is coded into a string called MISSION_PARAMS along with some other goodies which we'll pass to engage-cmd. These other goodies are flags indicating we don't want crypto for the channels and that we don't want timeslines - there's no need to tax the client machines' CPU with stuff we don't care about for this test. We also indicate that we don't want Engage to automatically switch to multicast failover mode if it can't get connected to a Rallypoint.

Next up, we create some identity information for our "user" This is simply a few strings that share a random number and, while not really necessary for our test, is useful when we're debugging and testing the test (yeah, that sounds strange but one does need to debug the stuff that's doing the debugging).

Almost there ...

Just before we actually launch the engage-cmd instance, we pause (sleep) for a second so that we don't try to fire up 100 processes in quick succession and overload the CPU of the client machine. This is not strictly necessary but we wanted to give the local machine a little breathing room in between starting up instances of engage-cmd.

Finally we actually invoke engage-cmd. We tell it that the mission we want is to be auto-generated based on MISSION_PARAMS by placing an "@" sign as the first character of the mission name. We also pass a few other parameters such as the certificate store to be used, the Rallypoint configuration file to use, and the "user's" identity. We also tell it we want it to be as quiet as possible with minimal logging so as to reduce the amount of I/O on the client machine (which we won't even see anyway). Very importantly, though, we pass in the name of the script that we want engage-cmd to run once it's up and chugging away. Here, the name of the script is lt.engagescript which we'll discuss in detail next.

Notice how we append the NOHUP "&" instruction at the end of the command-line. That causes the OS to start enage-cmd in the background and return to the script right away.

After the loop exits, we keep the bash script running in the little while loop at the end. This is important because, if you run this whole thing as a daemon on Linux (we used systemd), when the bash script exits, systemd kills all it's children - including all the copies of engage-cmd. Obviously we don't want that so we keep the bash script running forever and kill it either with Ctrl-C if we're running in a terminal or by stopping the daemon.

#!/bin/bash


# We will default to this many users
RUNCOUNT=100

# Our Engage logging level
LOGLEVEL=2

function showUsage()
{
    echo "usage: ${0} run <count>|stop"
}

if [ "${1}" == "run" ]; then
        if [ "${2}" != "" ]; then
                RUNCOUNT=${2}
        fi

        # See if we're running on Amazon Web Services
        AMAZON_CHECK=`grep Amazon /etc/os-release 2> /dev/null`
        if [ "${AMAZON_CHECK}" == "" ]; then
                if [[ "$OSTYPE" == "linux"* ]]; then
                        ENGAGE_CMD=~/fastdev/engage/src/.build/linux_x64/artifacts/engage-cmd
                else
                        ENGAGE_CMD=/Global/github/engage/src/.build/darwin_x64/artifacts/engage-cmd
                fi
        else
                ENGAGE_CMD=./engage-cmd

                export ENGAGE_LOG_LEVEL=1
                export LD_LIBRARY_PATH=./
        fi

        INSTANCE_PASSPHRASE_PORTION=${RANDOM}

        for ((x = 1; x <= ${RUNCOUNT}; x++)); do
                # Decide on mission parameters for this client invocation
                RND=$((${RANDOM} % 100))

                # We'll assume that most people would use smaller missions while just a
                # few would use large missions
                if [ ${RND} -lt 61 ]; then              # 60% of people in single channel
                        CHANS=1
                elif [ ${RND} -lt 81 ]; then            # 20% on 4-channel missions
                        CHANS=4
                elif [ ${RND} -lt 91 ]; then            # 10% on 8-channel missions
                        CHANS=8
                elif [ ${RND} -lt 96 ]; then            # 5% on 16-channel missions
                        CHANS=16
               else                                     # 5% on 32-channel missions
                        CHANS=32
                fi

                # Now we need a passphrase with which we'll generate the mission
                PP=`hostname`.${INSTANCE_PASSPHRASE_PORTION}.${CHANS}

                # Setup the mission parameters for engage-cmd.  We won't be using crypto, multicast failover, or timelines
                MISSION_PARAMS=[${PP}][${CHANS}][nocrypto,nomulticastfailover,notimeline]

                # Throw together an identity for our "user"
                IDX=`printf "%04d" "${x}"`
                UA=LT${IDX}
                UI=LT${IDX}
                UD=LoadTestUser${IDX}

                # Give the machine some breathing room
                sleep 1

                # Run it!
                ${ENGAGE_CMD} -mission:@${MISSION_PARAMS} \
                                -ep:policy.json \
                                -rp:rp.json \
                                -cs:engage-default.certstore \
                                -useadad \
                                -adadmic:sweet-dreams-16khz.raw \
                                -ui:${UI} \
                                -ud:${UD} \
                                -ua:${UA} \
                                -ll:${LOGLEVEL} \
                                -quiet \
                                -script:lt.engagescript &
        done

        # Keeping the script running so that when we run under systemd our children (engage-cmd) don't get killed
        while [ true ]; do
                sleep 10
        done;
elif [ "${1}" == "stop" ]; then
        killall engage-cmd
else
        showUsage
fi

The Engage-Cmd Scipt (lt.engagescript)

As described in other wiki articles, engage-cmd has a (rather primitive) scripting language that can be used to automate operations with the Engage Engine. The language is by no means sophisticated but suits our needs just fine and, because it's interpreted directly by engage-cmd, there's no need for external dependencies on things like Python, Ruby, Javascript, etc. There's some basic support for variables, conditionals, code flow, and such but these are pretty basic. So, be warned that it may look a little odd.

So, let's dive into the script we used.

First off, engage-cmd declares some global variables that the script can reference for it's purposes. In the lt script, we care about how many channels we have so that we know how many to create, how many to join, what to transmit on, and so on. These globals have names that start with "@MISSION_.

Starting out, we note that we want our user to be idle 75% of the time - the other 25% they'll be talking. Then we setup some variables we'll be using later on.

Once this is done, we create all the groups in our mission and proceed to join all of them. After this, we pause for between 1500 and 3500 milliseconds (note the "r2000" which indicates a random number between 0 and 2000) so that we stagger operations on the same machine across all the client instances.

Next, we get to our PTT loop. In each iteration of the loop we first determine which of the available audio channels we're going to transmit on. We do this by setting our channelIndex variable to a random number between 0 and the number of audio channels. Then we offset that index by the number of mission control (i.e. non-audio) groups we have to get the actual position in our mission configuration of the audio group we want.

Its very important here that the mission group list is structured such that all control groups come BEFORE the audio groups. Fortunately, the mission generator in Engage guarantees this so we're safe.

Now that we know which channel we might transmit on, we want to determine if we actually should. We do this by setting a variable named txChance to a random number between 0 and 100. We use the result of this to determine if we're going to transmit on this round by seeing if txChance is less than the IDLE_PERCENTAGE we set at the top of the script. If it is indeed less, we skip this round for transmission.

Assuming txChance is greater than or equal to IDLE_PERCENTAGE; we're actually going to transmit. So we simply increment a counter which we'll display later and instruct Engage to begin transmitting on the *channelIndex channel.

Now, whether we transmitted in this round or skipped our turn, we'll stay in whatever state we're in for between 2 and 7 seconds (note the sleep 2000 followed by sleep r5000). Then we'll stop transmitting. (If we were transmitting, it will end. If we weren't transmitting, Engage will just ignore the request to stop transmitting.)

Finally, we display a message (maybe we'll see if it we want to look at the logs), pause for between 10 and 15 seconds, and then go again.

# Engage Test Script

# NOTE: Global variables
#   @MISSION_OVERALL_GROUP_COUNT ........... count of all groups in the mission
#   @MISSION_AUDIO_GROUP_COUNT ............. count of audio groups
#   @MISSION_CONTROL_GROUP_COUNT ........... count of mission control groups
#   @MISSION_RAW_GROUP_COUNT ............... count of raw groups

set IDLE_PERCENTAGE 75

set countOfLoops 0
set countOfTx 0
set countOfSkip 0

# Get going
createupto ${@MISSION_OVERALL_GROUP_COUNT}
joinupto ${@MISSION_OVERALL_GROUP_COUNT}

# Wait for a little while for things to settle
sleep 1500

# Then, a little more time to give our CPU time to breathe
sleep r2000

:topOfLoop
    add countOfLoops 1

    # Determine a random channel index to transmit on.  What's very important
    # here is that the mission structure is such that all control groups come first
    # followed by audio groups, and then followed by raw groups and such.  So, our
    # random number is between 0 and @MISSION_AUDIO_GROUP_COUNT and then we offset
    # that by @MISSION_CONTROL_GROUP_COUNT to get the actual index position of the
    # audio group we'll be transmitting on.

    set channelIndex r${@MISSION_AUDIO_GROUP_COUNT}
    add channelIndex ${@MISSION_CONTROL_GROUP_COUNT}

    # Determine whether we will actually transmit at this point.  We do so
    # by getting a random number between 0 and 100 (txChance) ...
    set txChance r100

    # ... then, if txChance is less than our idle percentage we set
    # above, we skip this round
    onless txChance ${IDLE_PERCENTAGE} goto skipPtt

    # ... otherwise we'll transmit
    :doPtt
        add countOfTx 1
        begintx ${channelIndex}
        goto endOfLoop

    # Skip our turn
    :skipPtt
        add countOfSkip 1
        goto endOfLoop

    # Whether we transmitted or not, we'll end all TX, display some info and then wait
    # for up to 15 seconds before we go again
    :endOfLoop
        # Pause while transmitting (or not)
        sleep 2000
        sleep r5000
        endtx (allgroups)

        message.info Loops=${countOfLoops}, TX=${countOfTx}, Skip=${countOfSkip}

        # Take a breather
        sleep 10000
        sleep r5000

        # Go again
        goto topOfLoop


# We will never get here!
⚠️ **GitHub.com Fallback** ⚠️