Load tests - alphagov/notifications-manuals GitHub Wiki

We primarily use gatling to run our performance tests. We have a repo here: https://github.com/alphagov/notifications-performance-tests.

Normally, we run load tests on staging, from an EC2 instance (so that we know rates aren't limited by local laptop CPU/Memory/Network constraints) using live API keys. We sometimes send real messages, and sometimes we use provider stubs so that we don't have to pay or worry about deliverability stats.

Making Staging ready for load tests

Normally for load tests, we make Staging look as close to live as possible. If you're only doing a low volume test - say 100 requests per second - you don't need to enable everything other than switching to use stubs for email and SMS.

For high volume load tests - 800+ requests per second - you will need to enable all options.

Pause active staging pipelines

From notifications-aws run:

make load-pause-pipelines

Ensure you wait for any running pipelines to finish. You can run:

make load-check-pipelines

Convert the environment for load testing

From notifications-aws:

  • Ensure you github repo is up-to-date and on the main branch.
  • Check there are no pending changes to staging lingering in your workspace.
cd terraform/notify-apps
gds aws notify-staging-admin -- make staging plan

The plan should be empty.

  • Apply load testing profiles to tfvars for staging
cd terraform/notify-apps
gds aws notify-staging-admin -- make staging profile-load apply

profile-load includes all the following options:

  • (profile-prod-like)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/prod-scaling.tfvars]: Changes scaling options to be more production like.
  • (profile-sentry-disable)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/disable-sentry.tfvars]: Disables sentry. Ensures we don't use all our staging sentry budget while load testing.
  • (profile-enable-stubs)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/enable-stubs.tfvars]: Enables stubs for ses and sms messages.

These can be applied in individually if you don't want to do everything (i.e. low load test). You should certainly do enable stubs for any kind of volume.

Verify the stubs are working

Consider sending some manual notifications after these stubs are enabled and make sure that they are receiving calls from Notify (i.e. there isn't a misconfiguration that is causing notifications to be sent to the real providers). This can be done manually through the Notify web interface and tailing the provider logs through the aws console. You could also run a short test of 1 rps for 60 seconds.

Disable AWS WAF/Shield DDoS mitigation

We have protection configured on CloudFront, which sits in front of our API, to automatically detect and mitigiate DoS/DDoS attacks. These can result in CloudFront returning 403s if there are sudden spikes in requests volumes. If you are running high-volume load tests (800+ requests/second) then you may want to disable this mitigation temporarily.

In the notifications-aws repo, run gds aws notify-staging-admin -- ./scripts/ddos-protection/ddos-protection.sh, select the API CloudFront instance and choose 'Disable'.

Actually running a load test

Get access to the load test box

  • Log in to AWS staging gds aws notify-staging -l
  • Find the "Load test" instance in the EC2 dashboard
  • Copy the public DNS name: export LOAD_TEST_BOX="<DNS>"
  • Alternatively run:
    LOAD_TEST_BOX=$(gds aws notify-staging -- aws ec2 describe-instances --filters Name=tag:Name,Values="Load Test" | jq '.Reservations[].Instances[].PublicDnsName' -r)
  • Run the following in a local shell
    notify-pass show credentials/staging/performance_tests_ssh_key > ~/.ssh/id_perf_tests
    chmod 600 ~/.ssh/id_perf_tests
    ssh -i ~/.ssh/id_perf_tests -l ubuntu $LOAD_TEST_BOX

Configure and run a simulation

The simulation needs a simulation.conf file in order to run. You can copy and edit the default config or copy an existing profile on the machine (if there is one) e.g. cp simulation_uk_gov.conf simulation.conf. If you're making a new profile you'll need to populate all the missing values manually.

If you want to run the test against ecs only. Add use-ecs-apps = true under notify->basic in the simulation config file.

If you're running a long load test then you may lose connection to the box. To ensure the load tests keep running, use screen e.g.

screen
... # normal load test stuff
Ctrl+A Ctrl+D # background process

screen -x # come back later on
exit # quit the screen

You can also schedule load tests using Cron: run crontab -e from the load test box to see what's going on. Note: the %4 at the end of each line is to let crontab select the 4th option in the interactive prompt.

Viewing results about requests

After a performance test is finished, a timestamped folder is created containing an html based report. Download the folder by running the following command from a local bash shell (not inside the ssh).

scp -i ~/.ssh/id_perf_tests -r ubuntu@$LOAD_TEST_BOX:~/notifications-performance-tests/results/$RESULTS_FOLDER ~/Downloads/perf-test-results

Measuring Celery performance

Gatling only measures the duration of requests it makes to API. You can use the DB to understand the impact of a change on e.g. the time it takes to create, send and mark a notification as delivered.

select
  count(*),
  round(sum(case when time <= 1 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_1,
  round(sum(case when time <= 3 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_3,
  round(sum(case when time <= 5 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_5,
  round(sum(case when time <= 10 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_10,
  round(sum(case when time <= 30 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_30,
  round(sum(case when time <= 60 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_60
from (
  select extract (epoch from (updated_at - created_at)) as time
  from notifications
  where service_id='1a9d1d3d-f13f-434c-a135-cc30daebf5dd'
  and created_at >= '2021-10-27 14:00' and created_at <= '2021-10-27 15:00'
  and notification_status='delivered'
) as subquery;

Rolling Back

🚨 Ensure all traffic has finished processing (including making sure any sender queues are empty) before rolling back! If you skip this you can end up costing us lots of money and hitting our sms limits. 🚨

Check SQS depths in the aws console to ensure they are all 0.

Manually unpause the notify infra pipeline for staging. Start a new deployment with the "re-run with same inputs" of the last build.

Unpause the rest of the pipelines

From notifications-aws run:

make load-unpause-pipelines

Re-enable AWS WAF/Shield DDoS Mitigation

In the notifications-aws repo, run gds aws notify-staging-admin -- ./scripts/ddos-protection/ddos-protection.sh, select the API CloudFront instance and choose 'Enable' (Block).

⚠️ **GitHub.com Fallback** ⚠️