Load tests - alphagov/notifications-manuals GitHub Wiki
Normally, we run load tests on staging, from an EC2 instance (so that we know rates aren't limited by local laptop CPU/Memory/Network constraints) using live API keys. We sometimes send real messages, and sometimes we use provider stubs so that we don't have to pay or worry about deliverability stats.
We primarily use gatling to run our performance tests, which is installed on the staging performance test EC2 instance. We have a repo here: https://github.com/alphagov/notifications-performance-tests. There are instructions in the readme on how to setup and configure the load testing tool.
Normally for load tests, we make Staging look as close to live as possible. If you're only doing a low volume test - say 100 requests per second - you don't need to enable everything other than switching to use stubs for email and SMS.
For high volume load tests - 800+ requests per second - you will need to enable all options.
The provider stubs mimic the API endpoints used by the Notify SMS and email providers. Once enabled, the Notify instance you are testing will no longer be able to send emails or SMS to external recipients, and therefore you are unable to set up new users or sign in to the admin interface using a SMS or Email based 2FA code.
This means it is important to fully configure the test environment before enabling the stub-providers. Log in to the platform and set up a load test organisation, user, template, and API key required for the test. It is also recommended to have platform admin rights, and configure your yubikey as a second factor (so you can use this to log in to the admin app during a test if needed). More information can be found in the joiners process on how to do this.
From notifications-aws run:
Manually pause staging deployment from concourse
From notifications-aws:
- Ensure you github repo is up-to-date and on the
main
branch. - Check there are no pending changes to staging lingering in your workspace.
cd terraform/notify-apps
gds aws notify-staging-admin -- make staging plan
The plan should be empty.
- Apply load testing profiles to tfvars for staging
cd terraform/notify-apps
gds aws notify-staging-admin -- make staging profile-load apply
profile-load includes all the following options:
- (profile-prod-like)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/prod-scaling.tfvars]: Changes scaling options to be more production like.
- (profile-sentry-disable)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/disable-sentry.tfvars]: Disables sentry. Ensures we don't use all our staging sentry budget while load testing.
- (profile-enable-stubs)[https://github.com/alphagov/notifications-aws/blob/main/terraform/tfvars/profiles/enable-stubs.tfvars]: Enables stubs for ses and sms messages.
These can be applied in individually if you don't want to do everything (i.e. low load test). You should certainly do enable stubs for any kind of volume. To individually enable just provider stubs, see the (Enabling Stub Providers)https://github.com/alphagov/notifications-manuals/wiki/Stub-providers page in the team manual.
Consider sending some manual notifications after these stubs are enabled and make sure that they are receiving calls from Notify (i.e. there isn't a misconfiguration that is causing notifications to be sent to the real providers). This can be done manually through the Notify web interface and tailing the provider logs through the aws console. You could also run a short test of 1 rps for 60 seconds.
Running the tests without verifying you are using the stubs can cost a lot of money depending upon the number of sms/email message sent. BE CAREFUL.
We have protection configured on CloudFront, which sits in front of our API, to automatically detect and mitigiate DoS/DDoS attacks. These can result in CloudFront returning 403s if there are sudden spikes in requests volumes. If you are running high-volume load tests (800+ requests/second) then you may want to disable this mitigation temporarily.
In the notifications-aws
repo, run gds aws notify-staging-admin -- ./scripts/ddos-protection/ddos-protection.sh
, select the API CloudFront instance and choose 'Disable'.
- Log in to AWS staging
gds aws notify-staging -l
- Find the "Load test" instance in the EC2 dashboard
- Copy the public DNS name:
export LOAD_TEST_BOX="<DNS>"
- Alternatively run:
LOAD_TEST_BOX=$(gds aws notify-staging -- aws ec2 describe-instances --filters Name=tag:Name,Values="Load Test" | jq '.Reservations[].Instances[].PublicDnsName' -r)
- Run the following in a local shell
notify-pass show credentials/staging/performance_tests_ssh_key > ~/.ssh/id_perf_tests chmod 600 ~/.ssh/id_perf_tests ssh -i ~/.ssh/id_perf_tests -l ubuntu $LOAD_TEST_BOX
The simulation needs a simulation.conf
file in order to run. You can copy and edit the default config or copy an existing profile on the machine (if there is one) e.g. cp simulation_uk_gov.conf simulation.conf
. If you're making a new profile you'll need to populate all the missing values manually.
If you want to run the test against ecs only. Add use-ecs-apps = true
under notify->basic
in the simulation config file.
If you're running a long load test then you may lose connection to the box. To ensure the load tests keep running, use screen
e.g.
screen
... # normal load test stuff
Ctrl+A Ctrl+D # background process
screen -x # come back later on
exit # quit the screen
You can also schedule load tests using Cron: run crontab -e
from the load test box to see what's going on. Note: the %4
at the end of each line is to let crontab select the 4th option in the interactive prompt.
After a performance test is finished, a timestamped folder is created containing an html based report. Download the folder by running the following command from a local bash shell (not inside the ssh).
scp -i ~/.ssh/id_perf_tests -r ubuntu@$LOAD_TEST_BOX:~/notifications-performance-tests/results/$RESULTS_FOLDER ~/Downloads/perf-test-results
Gatling only measures the duration of requests it makes to API. You can use the DB to understand the impact of a change on e.g. the time it takes to create, send and mark a notification as delivered.
select
count(*),
round(sum(case when time <= 1 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_1,
round(sum(case when time <= 3 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_3,
round(sum(case when time <= 5 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_5,
round(sum(case when time <= 10 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_10,
round(sum(case when time <= 30 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_30,
round(sum(case when time <= 60 then 1 else 0 end)::decimal / count(*), 2) * 100 as under_60
from (
select extract (epoch from (updated_at - created_at)) as time
from notifications
where service_id='1a9d1d3d-f13f-434c-a135-cc30daebf5dd'
and created_at >= '2021-10-27 14:00' and created_at <= '2021-10-27 15:00'
and notification_status='delivered'
) as subquery;
🚨 Ensure all traffic has finished processing (including making sure any sender queues are empty) before rolling back! If you skip this you can end up costing us lots of money and hitting our sms limits. 🚨
🚨Check SQS depths in the aws console to ensure they are all 0. 🚨
Manually unpause the notify infra pipeline for staging. Start a new deployment with the "re-run with same inputs" of the last build.
From notifications-aws run:
make load-unpause-pipelines
In the notifications-aws
repo, run gds aws notify-staging-admin -- ./scripts/ddos-protection/ddos-protection.sh
, select the API CloudFront instance and choose 'Enable' (Block).