Disaster recovery v2 - alphagov/notifications-manuals GitHub Wiki

Introduction

This document provides a summary of Business Continuity Plan (BCP) and Disaster Recovery (DR) plans for the GOV.UK Notify service. This document relates only to the Notify team, critical suppliers and service, and does not cover offices, corporate IT or staffing issues which should be covered by the GDS and/or Cabinet Office BCP.

This document should be read in conjunction with BCP/DR plans for GDS managed by the workplace management team and GDS IT continuity plan.

This document should act as a summary of resilience (so the risk level can be assessed), and also provide an overview of residual risks. Links to specific process and procedures to use during a BCP/DR situation or test exercise are also included.

All processes and procedures linked from this document should be used alongside the Notify incident process, which should be followed at all time.

This document is for internal use only, and should not be shared outside of GDS.

Notify SaaS

Notify is hosted on Amazon AWS, predominantly using the eu-west-1 (Ireland) region, using availability zones a,b and c for resilience. The CI/CD deployment system is hosted use eu-west-2 (London).

Each Notify environment (multiple dev, staging, production, deploy, monitoring and tools) has its own segregated AWS account, provisioned by GDS engineering enablement.

The gds cli tool is used by the Notify team members to authenticate to the AWS console and assume roles setup by GDS Engineering Enablement. Console access is not required for normal operation of the Notify platform, but is needed during BCP/DR scenarios.

Notify's infrastructure architecture extensively uses AWS high availability capabilities using availability zones within a region (as described in the sections for the critical components below). However, Notify is not designed to withstand an AWS region failure and this is recorded as a risk.

Software components

Notify software components (notifications-admin, notifications-api, document-download atc) are deployed as docker containers to AWS Elastic Container Service (ECS). Contains run on AWS Fargate, a serverless compute resource which runs each container in an isolated boundary (so that it does not share kernel, CPU, memory or network resources with other containers). ECS is configured to balance load across all three availability zones in eu-west-1

Multiple instances of the same Notify component (or task) is run by ECS, with a AWS application load balancer routing traffic to healthy instances to balance load.

Each ECS container instance is configured to have a minimum and maximum number of instances running concurrently, and these running instances are evenly distributed across Fargate resources in the three availability zones in eu-west-1. Should a container instance fail, ECS will automatically destroy the failed instance and create a new one before registering it with the load balancer. Autoscaling (based on CPU usage) is also configured to dynamically increase and decrease the number of instances of a given task needed to cope with load on the system.

This design provides resilience in the event of:

A single instance of a component failing because of a software error
A Fargate instance is removed from service by AWS
An AWS availability zone has availability or capacity issues.

This design also allows the deployment system to create new images in the Elastic Container Registry (ECR), and to deploy them using a rolling update to ensure the Notify service remains operational during a deployment. In the event of a failed deployment, the CI/CD system can be configured to roll back the changes.

Code

Notify code is stored at github.com in the Alphagov organisation structure. Permission models, security controls and code review process relating to code updates are applied as per the GDS Way.

Repositories from github are not backed up (other than the code also being on Notify team members systems), although github.com has a function that allows deleted repositories to be restored within 90 days of deletion. In the event of github.com availability issues, the Concourse CI/CD system is unable to build and deploy Notify software and infrastructure updates. This is recorded as a risk.

Database

Notify uses a PostgreSQL database to store application data including service details, user accounts, templates, API keys, notification status and billing. PostgreSQL is delivered using the Amazon Relational Database Service (RDS) PaaS service, using a Multi-AZ with one standby architecture. The RDS PaaS solution provides automated backup, and failover capabilities. RDS also offers automated patching capabilities, however this is currently disabled as this has caused availability issues in the past. All RDS patches are scheduled by the infrastructure team during a maintenance window.

The Multi-AZ with one standby RDS architecture replicates traffic from a live database instance to a standby instance in a separate availability zone. In the event of failure of the primary database (catastrophic PostgreSQL software failure, server failure, or AZ availability issue), RDS will automatically promote the standby PostgreSQL instance to primary and adjust the DNS records so that applications will seamlessly use the newly promoted database. This failover process is also used during patching windows to ensure maximum availability of the database, and thus testing of the RDS failover process is completed regularly.

The Notify Amazon RDS is configured with automatic backups enabled. The backup system is configured to take daily incremental snapshots of the RDS instance (backing up the standby instance EBS disk to prevent table locking), and is stored in a separate AZ to the live database instance. Snapshots are configured to be retained for 31 days and are stored by RDS in S3.

Automatic backup also makes backups of transaction logs in five minute buckets to allow point in time recovery. This supports a Recovery Point Objective (RPO) of approximately 5 minutes. This provides resilience in the event of data corruption caused by the application or an administrative action (such as dropping a table).

Notify RDS has deletion protection enabled (to prevent accidental deletion), and backups are not automatically deleted when an instance is deleted (although this can be requested by the user).

The Amazon RDS Multi-AZ architecture offer high availability and resilience using multiple AZs. However, it does not provide protection for the following scenarios:

AWS issues affecting a region (eu-west-1)
Malicious insider deleting the database and backups
Ransomware / extortion attacks should an attacker gain access to the infrastructure

This is recorded as a risk.

Caching servers

Notify uses Redis for caching to increase performance, and for supporing rate limiting functions. Redis is delivered using the Amazon Elasticache PaaS service, using a Multi-AZ failover architecture configured with a primary and two replica nodes hosted in different AZs. The Elasticache PaaS solution provides automated backup, minor patching and failover capabilities. In the event of failure of the primary cache server (catastrophic Redis software failure, server failure, or AZ availability issue), Elasticache will automatically promote the one of the standby instances to primary and adjust the DNS records so that applications will seamlessly use the newly promoted cache. This failover process is also used during automatic patching windows to ensure maximum availability of the system, and thus testing of the cache server failover process is completed regularly.

The Notify Easticache system is configured with automatic backups enabled. The backup system is configured to take daily snapshots of Redis. Snapshots are configured to be retained for 30 days. In practice backups are unlikely to be used in the event of a DR situation. In the event of data corruption in Redis or loss of the whole Elasticache service, the application will continue to operate in a degraded manner:

Queries that are usually cached will be queried directly from the database
Rate limiting will be inoperable

In the event of cache getting stale, or data corruption the Notify team manual contains instructions on how to clear the cache and allow it to rebuild.

Document store

Notify uses the Amazon Simple Storage Service (S3) as an object store to store:

System data
- Cloudfront, s3 and load balancer logs
- Static website content
- Lambda code (run by a Amazon lambda task)
- Template preview cache
- Terraform state
Customer data
- CSV uploads including contact and emergency lists
- Precompiled letters
- Letter attachments
- Document download files
- Logos

Notify use the S3 standard storage class which means objects are automatically stored across multiple devices spanning a minimum of three Availability Zones (AZs). S3 standard supports an availability SLA of 99.99%. In addition, object versioning is used to protection against accidental overwrite and accidental deletion and is enabled for the following buckets:

Terraform state
Document download files

Note that object versioning does not prevent an admin user from permanently deleting S3 objects. S3 offers a MFA protected delete setting, but this is not enabled as Terraform would be unable to manage the buckets.

The Amazon S3 architecture offer high availability and resilience using multiple AZs. However, it does not provide protection for the following scenarios:

AWS issues affecting a region (eu-west-1)
System or application misconfiguration that causes a bucket or object to be deleted (with the exception of buckets with versioning enabled)
Malicious insider deleting the documents
Ransomware / extortion attacks should an attacker gain access to the infrastructure

This is recorded as a risk.

Queuing

Notify has a microservices architecture which uses Celery task queues to provide flexible task management (retries etc) and scaling of the platform. Celery is configured to use Amazon Simple Queue Service (SQS) as a managed message queue. Amazon SQS stores all message queues and messages across multiple redundant Availability Zones (AZs), so that no single computer, network, or AZ failure can make messages inaccessible. Queues are dynamically created by the Notify application (in the event they don't already exist). SQS should be considered resilient. In the event of region outage, the general SaaS hosting risk applies (data is lost).

DNS

Notify DNS records are provisioned by terraform (including SPF and DMARC) or via other AWS services (such as RDS) and are managed by Amazon route53. As such any loss of DNS records can be resolved with a Terraform deploy.

Route53 is a globally distributed DNS infrastructure which provides resilience to internet or network related issues. Route53 should be considered resilient.

Domains

Production domain names are procured and managed by the GOV.UK team, and delegated to the Notify production account via govuk-dns-tf to be managed in Route53 (see DNS section).

In the event of a domain deligation or renewal issue, a high severity issue should be raised with GOV.UK support. This is recorded as a risk.

Bastion

Notify uses AWS security groups to prevent direct access to the Notify RDS database, so that only Notify applications can connect to the database. For administration of the database, Notify deploy a hardened EC2 host as a Bastion. The gds cli tool is used by the Notify team to authenticate and assume a suitable role (with database readonly or read/write permissions). The db-connect script then allows Notify admins to tunnel through to the database using a short lived SSH tunnel to complete administration tasks.

Access to the Bastion is not required for normal operation of the Notify system, however it is used frequently by team members to execute ad hoc queries and investigate system issues. It is likely the Bastion would be used during a DR / BCP scenario.

AWS EC2 does not offer resilient host capabilities and is a single point of failure, and so a risk relating to the Bastion is recorded as a risk.

Emails

Notify allows services to send outbound emails to citizens via the admin interface and also via the API. Notify sends emails from the notifications.service.gov.uk domain using Amazon Simple Email Service. SES is configured with a dedicated pool of 25 IP addresses from which to send messages.

Notify does not offer dedicated IP addresses for each sending service (which is likely impractical) and emails are sent from different email addresses from the same domain. There is therefore a risk that should a service send an email which large numbers of users flag as spam (or is automatically marked as spam), the spam reputation of Notify sending IP addresses or domains may be impacted, resulting in higher than normal automatic spam filtering of all messages sent. This is recorded as a risk.

Amazon SES is a PaaS solution which offers resilience across availability zones within a region. SES also supports global endpoints that can create cross region resilience, however this is not enabled for the Notify service (and would only be applicable should Notify become Region fault tolerant in the future). In the event of region outage, the general SaaS hosting risk applies (data is lost and service is unavailable).

Notify third part integrations

Notify uses third party services for the delivery of notifications sent via the platform. This section describes the resilience and risks associated with services used to deliver SMS and letter notifications.

UK Outbound SMS

Notify has been integrated with FireText and MMG as outbound suppliers for UK destined mobile SMS messages. The Notify software architecture has a balancing algorithm which spreads the outbound SMS load equally across both suppliers during normal operation. In the event that a supplier has availability or performance issues, the Notify balancing algorithm will dynamically adjust the amount of traffic sent to each supplier to ensure reliable delivery.

Each SMS supplier has it's own high availability features and BCP/DR plans, which are summarised in Appendix A.

Outbound SMS to UK numbers should be considered resilient.

International SMS

Notify supports sending of SMS notifications to international mobile numbers. For cost reasons, only MMG is used for sending international SMS messages. MMG has high availability features and a BCP/DR plan which are summarised here.

In the event of an MMG outage Notify is unable to send International SMS messages. This is recorded as a risk.

Inbound SMS

Notify supports inbound SMS where citizens can send an SMS to a number or reply to a SMS sent from Notify, and the message is made available for the Notify service team via both the admin interface and the API using a callback. For cost reasons, standard inbound SMS services are only supplied by MMG. MMG has high availability features and a BCP/DR plan which are summarised here.

For historical reasons the there is an exception to using MMG for all inbound inbound messages. An inbound SMS number number used by GovWifi for wifi setup is supplied by FireText. FireText has high availability features and a BCP/DR plan which are summarised here.

In the event of an MMG outage Notify is unable to receive inbound SMS messages. In the event of a FireText outage Notify is unable to receive inbound GovWifi messages. These are recorded as risks.

Letters

Notify support sending of letters using templates or precompiled PDFs. DVLA is used for letter printing and dispatch via Royal Mail or UK Mail. DVLA has high availability features and a BCP/DR plan which are summarised here.

In the event of DVLA printing outage, Notify is unable to send letters. This is recorded as a risk.

Notify deployment (CI/CD)

The Notify Continuous Integration / Continuous Deployment (CI/CD) system uses concourse automation to execute Notify deployment tasks using initialization scripts, credentials repository, Github repositories containing Notify code, Terraform, and Docker. Concourse is used to deploy infrastructure and software changes to Notify, along with running automated tests. More information about the concourse instance used to deploy Notify infrastructure and software can be found in the team manual

Concourse is hosted on the AWS deploy account in eu-west-2 (London) using EC2 workers and an Amazon RDS database for storing state. Concourse is itself deployed to AWS using Terraform using a concourse pipeline. RDS is in Multi AZ mode, and has automated backups enabled with retain concourse state for 5 days. The concourse database therefore has similar resilience properties as the main Notify database (albeit with less retention period).

In the event that Concourse is unavailable, the Notify team are unable to deploy infrastructure and software updates. This is recorded as a risk.

Logging and monitoring

Notify logging and monitoring capabilities use a mixture of in house managed tools and cloud based services described in this section.

Metrics

Notify software metrics are collected from the system using OpenTelemetry and stats.d agents and pushed to a Prometheus time series database. Prometheus Alertmanager is used to generate alerts, and Grafana is used to display stats. Each Notify environment has it's own prometheus database and Grafana instance (documented as it will be under new pipeline), which are deployed as part of the standard Notify deployment process using Concourse.

Notify intrastructure metrics are sent to Amazon Cloudwatch, and some threshold monitoring alerts have been set up to send warnings to the team in the event of significant load or application failure rates.

In the event that the metrics system is unavailable, the Notify team are unable to monitor the performance and availability of the system. This is recorded as a risk.

Logging

Notify application logs and some infrastrucure logs (for example cloudfront) are sent to a logit managed ELK log management platform, and traces are sent to Sentry. Logit resilience and high availability is discussed in their GCloud submission here.

In the event that the logging system is unavailable, the Notify team are unable to monitor the performance and availability of the system. This is recorded as a risk.

Availability monitoring

Incident escalation

Security

Security of the Notify system is tested on an annual basis as part of the IT Health Check (ITHC) process. Findings are recorded, and prioritised based on severity. In addition Notify is subject to regular a risk review, and a Risk Treatment Plan (RTP) is maintained. Notify also attend quarterly security working group meetings with colleagues from Infosec to discuss Notify security risks and mitigations.

It is not the intention to record the issues and risks identified from the above work in this document. However, two generic risks are recorded in the risks section.

Risks

Based on the summary of resilience above, the following is a list of residual risks, actions needed to mitigate and links to processes.

Risk	Status	Outcome	Action	Process
AWS region is unavailable for a short period	Accepted risk	Notify will be unavailable for period of outage	Incident process to manage communications only	Notify incident plan
AWS region is unavailable for an extended period	Accepted risk	Notify will be unavailable until rebuilt in separate Region. Once rebuilt Notify would not be able to restore data, and therefore not have any historical records. Services would have to re-signup and reconfigure.	Reconfigure the CI/CD system to rebuild Notify in another region.	To be written
Github.com unavailable for extended period (system issue, business issue, accidental deletion or cyber security event)	Accepted risk	Notify CI/CD unable to push software	Create new code management account (local or cloud) and ask developers to push latest versions of code and branches. Reconfigure concourse to use new code management solution.	N/A
System or application misconfiguration that modifies or deletes data from the Notify database	Mitigated risk	System outage while database is restored	Restore database using point in time backup. Followed by manually fixing data in current database or reconfiguring application to use restored database	Database restore process last restore ticket
Malicious insider or ransomware group deletes the database and backups	Accepted risk	Notify will be unavailable until rebuilt in separate AWS account. Once rebuilt Notify would not be able to restore data, and therefore not have any historical records. Services would have to re-signup and reconfigure.	Configure the CI/CD system to rebuild Notify in a separate account / or overwrite exist database with empty version once system has been secured.	To be written
System or application misconfiguration that modifies or deletes data from the s3 object store with versioning enabled	Mitigated risk	Document download feature and/or terraform deployment unavailable until objects are restored.	Restore objects using S3 versioning.	To be written
System or application misconfiguration that modifies or deletes data from the s3 object store where version is not available	Accepted risk	No letter, document-download, emergency contact, or contact list services available. Logos in templates do not work	Request services re-upload content to resume using their Notify service	N/A
Malicious insider or ransomware group permanently deletes the s3 object store	Accepted risk	Letters, logos, documents, contact lists and AWS logs will be lost. Likely need to redeploy notify to rebuild terraform state file and switch over to new infrastructure.	Request services re-upload content to resume using their Notify service once system has been secured. Rebuild infra with terraform and switch over	To be written
Notify production domain delegation settings are changed, or the domain is not renewed	Accepted risk	Notify services will be unavailable. As this is a .gov.uk domain it is unlikely there is a risk of domain squatting or misuse due to the rules associated with gov.uk domain names	Contact GOV.UK to raise a high severity support case	To be written
Database Bastion is unavailable due to EC2 availability, system issues or gds cli issues	Mitigated risk	Service will continue to operate. Database administration will be unavailable until the Bastion is restored or a temporary security group rule is put in place to allow emergency access	Redeploy bastion via Terraform or enable a temporary security group to allow access to the database from a known IP address	To be written
Notify email domain or sending IP addresses get marked as Spam	Accepted risk	Notify email delivery success rates drop due to spam filtering and black listing	Explore options for adding a second pool of IP addresses and using different domains to send emails (this will take significant time because of domain registration and IP warm up spam rules). Resolve initial spam issue and contact providers to remove domains and IP addresses from spam lists.	N/A
MMG is unable to process messages (system issues, business event, cyber security incident etc)	Accepted risk	Notify is unable to send SMS to international numbers. Notify unable to receive inbound SMS messages	Explore options for international and inbound SMS with other suppliers	N/A
FireText is unable to process SMS messages (system issues, business event, cyber security event etc)	Accepted risk	Notify unable to receive inbound SMS messages for the GovWifi service	Explore options for inbound SMS for GovWifi with other suppliers	N/A
DVLA is unable to process letters (system issues, business event, cyber security event etc)	Accepted risk	Notify unable to send letters	Explore options for second letter supplier	N/A
Concourse system error or misconfiguration makes concourse unavailable	Mitigated risk	Notify unable to deploy Notify software and infrastructure changes	Redeploy concourse using terraform from laptop and restore database from backup	To be written
Malicious insider or ransomware group deletes the Concourse database and backups	Accepted risk	Notify unable to deploy Notify software and infrastructure changes	Rebuild the concourse CI/CD system using terraform from laptop, CI/CD history to be lost meaning rollback is unavailable	To be written
Prometheus or Grafana is unavailable (system issue, configuration issue etc)	Accepted risk	Notify team unable to monitor the performance and availability of the Notify system	System to be rebuilt using Concourse	N/A
Log.it ELK stack is unavailable for extended period (system issues, business event, cyber security event etc)	Accepted risk	Notify team unable to monitor the logs from the Notify system	Team find alternative logging solution	N/A
Security incident detected relating to a single service	Mitigated risk	Impact on affected service only	Access to the service will be revoked and standard incident process followed to report cyber security incident.	To be written
Serious security incident detected	Accepted risk	Notify will be unavailable	Access to Notify should be suspended and the incident process should be followed to report cyber security incident.	To be written

Appendix A - Notify Suppliers

DVLA

The Mail and Print Notification service is in scope of the DVLA ISO 27001 information security management system, and has signed a Memorandum of Understanding (MOU) with Notify covering performance OLAs as well as cyber security, data privacy, physical and personnel security controls.

DVLA host their printhub software and APIs in AWS (eu-west-2). They also have two physical site containing Ricoh printers used to print and dispatch letters, providing a level a resilience for equipment or site issues. No further BCP or business continuity items are recorded in the MOU.

FireText

FireText SMS services are in scope of an ISO 27001 information security management system. FireText have shared their Business Continuity Plan with the Notify team.

The FireText BCP document only contains high level information. However it is possible to summerise that FireText have deployed their infrastructure with redundant application servers within their active data centre, and use have a redundant data centre design for BCP/DR scenarios.

FireText have an automated incident notification system, which is integrated with the govuk-notify-supplier-status slack channel.

MMG

The MMG service description provides an overview of MMG high availability design.

In summary, MMG have deployed pairs of F5 load balancers and MySQL databases for resilience and traffic is balanced over multiple API servers. MMG primarily use Rackspace LON5, and use Rackspace LON3 as a backup. The rackspace datacentres used by MMG also benefit from Uninterruptible Power Supply (UPS) and Diesel Generator protection.

logit

Logit resilience and high availability is discussed in their GCloud submission here. High availability Elasticsearch clusters are deployed for resilience and scalability.