Load tests - alphagov/notifications-manuals GitHub Wiki
We primarily use gatling to run our performance tests. We have a repo here: https://github.com/alphagov/notifications-performance-tests.
Normally, we run load tests on staging, from an EC2 instance (so that we know rates aren't limited by local laptop CPU/Memory/Network constraints) using live API keys. We sometimes send real messages, and sometimes we use provider stubs so that we don't have to pay or worry about deliverability stats.
Normally for load tests, we make Staging look as close to live as possible. If you're only doing a low volume test - say 100 requests per second - you don't need to enable everything other than switching to use stubs for email and SMS.
For high volume load tests - 800+ requests per second - you will need to enable all options.
From notifications-aws run:
make load-pause-pipelines
Ensure you wait for any running pipelines to finish. You can run:
make load-check-pipelines
From notifications-aws:
- Ensure you github repo is up-to-date and on the
main
branch. - Check there are no pending changes to staging lingering in your workspace.
cd terraform/notify-apps
gds aws notify-staging-admin -- make staging plan
The plan should be empty.
- Apply load testing profiles to tfvars for staging
cd terraform/notify-apps
gds aws notify-staging-admin -- make staging profile-load apply
profile-load includes all the following options:
- (profile-prod-like)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/prod-scaling.tfvars]: Changes scaling options to be more production like.
- (profile-sentry-disable)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/disable-sentry.tfvars]: Disables sentry. Ensures we don't use all our staging sentry budget while load testing.
- (profile-enable-stubs)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/enable-stubs.tfvars]: Enables stubs for ses and sms messages.
These can be applied in individually if you don't want to do everything (i.e. low load test). You should certainly do enable stubs for any kind of volume.
Consider sending some manual notifications after these stubs are enabled and make sure that they are receiving calls from Notify (i.e. there isn't a misconfiguration that is causing notifications to be sent to the real providers). This can be done manually through the Notify web interface and tailing the provider logs through the aws console. You could also run a short test of 1 rps for 60 seconds.
We have protection configured on CloudFront, which sits in front of our API, to automatically detect and mitigiate DoS/DDoS attacks. These can result in CloudFront returning 403s if there are sudden spikes in requests volumes. If you are running high-volume load tests (800+ requests/second) then you may want to disable this mitigation temporarily.
In the notifications-aws
repo, run gds aws notify-staging-admin -- ./scripts/ddos-protection/ddos-protection.sh
, select the API CloudFront instance and choose 'Disable'.
- Log in to AWS staging
gds aws notify-staging -l
- Find the "Load test" instance in the EC2 dashboard
- Copy the public DNS name:
export LOAD_TEST_BOX="<DNS>"
- Alternatively run:
LOAD_TEST_BOX=$(gds aws notify-staging -- aws ec2 describe-instances --filters Name=tag:Name,Values="Load Test" | jq '.Reservations[].Instances[].PublicDnsName' -r)
- Run the following in a local shell
notify-pass show credentials/staging/performance_tests_ssh_key > ~/.ssh/id_perf_tests chmod 600 ~/.ssh/id_perf_tests ssh -i ~/.ssh/id_perf_tests -l ubuntu $LOAD_TEST_BOX
The simulation needs a simulation.conf
file in order to run. You can copy and edit the default config or copy an existing profile on the machine (if there is one) e.g. cp simulation_uk_gov.conf simulation.conf
. If you're making a new profile you'll need to populate all the missing values manually.
If you want to run the test against ecs only. Add use-ecs-apps = true
under notify->basic
in the simulation config file.
If you're running a long load test then you may lose connection to the box. To ensure the load tests keep running, use screen
e.g.
screen
... # normal load test stuff
Ctrl+A Ctrl+D # background process
screen -x # come back later on
exit # quit the screen
You can also schedule load tests using Cron: run crontab -e
from the load test box to see what's going on. Note: the %4
at the end of each line is to let crontab select the 4th option in the interactive prompt.
After a performance test is finished, a timestamped folder is created containing an html based report. Download the folder by running the following command from a local bash shell (not inside the ssh).
scp -i ~/.ssh/id_perf_tests -r ubuntu@$LOAD_TEST_BOX:~/notifications-performance-tests/results/$RESULTS_FOLDER ~/Downloads/perf-test-results
Gatling only measures the duration of requests it makes to API. You can use the DB to understand the impact of a change on e.g. the time it takes to create, send and mark a notification as delivered.
select
count(*),
round(sum(case when time <= 1 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_1,
round(sum(case when time <= 3 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_3,
round(sum(case when time <= 5 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_5,
round(sum(case when time <= 10 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_10,
round(sum(case when time <= 30 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_30,
round(sum(case when time <= 60 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_60
from (
select extract (epoch from (updated_at - created_at)) as time
from notifications
where service_id='1a9d1d3d-f13f-434c-a135-cc30daebf5dd'
and created_at >= '2021-10-27 14:00' and created_at <= '2021-10-27 15:00'
and notification_status='delivered'
) as subquery;
🚨 Ensure all traffic has finished processing (including making sure any sender queues are empty) before rolling back! If you skip this you can end up costing us lots of money and hitting our sms limits. 🚨
Check SQS depths in the aws console to ensure they are all 0.
Manually unpause the notify infra pipeline for staging. Start a new deployment with the "re-run with same inputs" of the last build.
From notifications-aws run:
make load-unpause-pipelines
In the notifications-aws
repo, run gds aws notify-staging-admin -- ./scripts/ddos-protection/ddos-protection.sh
, select the API CloudFront instance and choose 'Enable' (Block).