Load tests - alphagov/notifications-manuals GitHub Wiki

Normally, we run load tests on staging, from an EC2 instance (so that we know rates aren't limited by local laptop CPU/Memory/Network constraints) using live API keys. We sometimes send real messages, and sometimes we use provider stubs so that we don't have to pay or worry about deliverability stats.

We primarily use gatling to run our performance tests, which is installed on the staging performance test EC2 instance. We have a repo here: https://github.com/alphagov/notifications-performance-tests. There are instructions in the readme on how to setup and configure the load testing tool.

Making Staging ready for load tests

Normally for load tests, we make Staging look as close to live as possible. If you're only doing a low volume test - say 100 requests per second - you don't need to enable everything other than switching to use stubs for email and SMS.

For high volume load tests - 800+ requests per second - you will need to enable all options.

Setup your testing org

The provider stubs mimic the API endpoints used by the Notify SMS and email providers. Once enabled, the Notify instance you are testing will no longer be able to send emails or SMS to external recipients, and therefore you are unable to set up new users or sign in to the admin interface using a SMS or Email based 2FA code.

This means it is important to fully configure the test environment before enabling the stub-providers. Log in to the platform and set up a load test organisation, user, template, and API key required for the test. It is also recommended to have platform admin rights, and configure your yubikey as a second factor (so you can use this to log in to the admin app during a test if needed). More information can be found in the joiners process on how to do this.

Pause active staging pipelines

From notifications-aws run:

Manually pause staging deployment from concourse

Convert the environment for load testing

From notifications-aws:

Ensure you github repo is up-to-date and on the main branch.
Check there are no pending changes to staging lingering in your workspace.

cd terraform/notify-apps
gds aws notify-staging-admin -- make staging plan

The plan should be empty.

Apply load testing profiles to tfvars for staging

cd terraform/notify-apps
gds aws notify-staging-admin -- make staging profile-load apply

profile-load includes all the following options:

(profile-prod-like)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/prod-scaling.tfvars]: Changes scaling options to be more production like.
(profile-sentry-disable)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/disable-sentry.tfvars]: Disables sentry. Ensures we don't use all our staging sentry budget while load testing.
(profile-enable-stubs)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/enable-stubs.tfvars]: Enables stubs for ses and sms messages.

These can be applied in individually if you don't want to do everything (i.e. low load test). You should certainly do enable stubs for any kind of volume. To individually enable just provider stubs, see the (Enabling Stub Providers)https://github.com/alphagov/notifications-manuals/wiki/Stub-providers page in the team manual.

Verify the stubs are working

Consider sending some manual notifications after these stubs are enabled and make sure that they are receiving calls from Notify (i.e. there isn't a misconfiguration that is causing notifications to be sent to the real providers). This can be done manually through the Notify web interface and tailing the provider logs through the aws console. You could also run a short test of 1 rps for 60 seconds.

Running the tests without verifying you are using the stubs can cost a lot of money depending upon the number of sms/email message sent. BE CAREFUL.

Disable AWS WAF/Shield DDoS mitigation

We have protection configured on CloudFront, which sits in front of our API, to automatically detect and mitigiate DoS/DDoS attacks. These can result in CloudFront returning 403s if there are sudden spikes in requests volumes. If you are running high-volume load tests (800+ requests/second) then you may want to disable this mitigation temporarily.

In the notifications-aws repo, run gds aws notify-staging-admin -- ./scripts/ddos-protection/ddos-protection.sh, select the API CloudFront instance and choose 'Disable'.

Actually running a load test

Get access to the load test box

Log in to AWS staging gds aws notify-staging -l
Find the "Load test" instance in the EC2 dashboard
Copy the public DNS name: export LOAD_TEST_BOX="<DNS>"

Alternatively run:

LOAD_TEST_BOX=$(gds aws notify-staging -- aws ec2 describe-instances --filters Name=tag:Name,Values="Load Test" | jq '.Reservations[].Instances[].PublicDnsName' -r)

Run the following in a local shell

notify-pass show credentials/staging/performance_tests_ssh_key > ~/.ssh/id_perf_tests
chmod 600 ~/.ssh/id_perf_tests
ssh -i ~/.ssh/id_perf_tests -l ubuntu $LOAD_TEST_BOX

Configure and run a simulation

The simulation needs a simulation.conf file in order to run. You can copy and edit the default config or copy an existing profile on the machine (if there is one) e.g. cp simulation_uk_gov.conf simulation.conf. If you're making a new profile you'll need to populate all the missing values manually.

If you want to run the test against ecs only. Add use-ecs-apps = true under notify->basic in the simulation config file.

If you're running a long load test then you may lose connection to the box. To ensure the load tests keep running, use screen e.g.

screen
... # normal load test stuff
Ctrl+A Ctrl+D # background process

screen -x # come back later on
exit # quit the screen

You can also schedule load tests using Cron: run crontab -e from the load test box to see what's going on. Note: the %4 at the end of each line is to let crontab select the 4th option in the interactive prompt.

Viewing results about requests

After a performance test is finished, a timestamped folder is created containing an html based report. Download the folder by running the following command from a local bash shell (not inside the ssh).

scp -i ~/.ssh/id_perf_tests -r ubuntu@$LOAD_TEST_BOX:~/notifications-performance-tests/results/$RESULTS_FOLDER ~/Downloads/perf-test-results

Measuring Celery performance

Gatling only measures the duration of requests it makes to API. You can use the DB to understand the impact of a change on e.g. the time it takes to create, send and mark a notification as delivered.

select
  count(*),
  round(sum(case when time <= 1 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_1,
  round(sum(case when time <= 3 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_3,
  round(sum(case when time <= 5 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_5,
  round(sum(case when time <= 10 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_10,
  round(sum(case when time <= 30 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_30,
  round(sum(case when time <= 60 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_60
from (
  select extract (epoch from (updated_at - created_at)) as time
  from notifications
  where service_id='1a9d1d3d-f13f-434c-a135-cc30daebf5dd'
  and created_at >= '2021-10-27 14:00' and created_at <= '2021-10-27 15:00'
  and notification_status='delivered'
) as subquery;

Rolling Back

🚨 Ensure all traffic has finished processing (including making sure any sender queues are empty) before rolling back! If you skip this you can end up costing us lots of money and hitting our sms limits. 🚨

🚨Check SQS depths in the aws console to ensure they are all 0. 🚨

Manually unpause the notify infra pipeline for staging. Start a new deployment with the "re-run with same inputs" of the last build.

Unpause the rest of the pipelines