Disaster recovery plan - codemagic-ci-cd/company-handbook GitHub Wiki
Disaster recover objectives
RTO (Recovery Time Objective)
- 8h (based on the largest SLA)
RPO (Recovery Point Objective)
- 4 hours for users and teams data, application configurations.
- 7 days for build history.
Major goals of a disaster recovery plan
Hardware and Software Inventory
information services backup overview
Database: Backup and Restore policy
International escalations procedure
DRP steps
This plan assumes all MongoDB nodes are unavailable and describes replacing the cluster with restoring data from backup files.
- Checklist before start
- GCP
us-east1-bzone operational as expected - Granted access to download latest backup files
- GCP
- Start new instances for the new cluster using the same instance type and MongoDB version as existing cluster.
- Provisioning new nodes to setup monitoring (see https://github.com/codemagic-dev/ansible/blob/main/setup_grafana_monitoring.yml)
- Ensure MongoDB connected to the new cluster.
- Download latest backup files to the master host.
- Restore the files in the following order:
- backup file with
allprefix applicationsfileteamsfileusersfileaudit_logfile
- backup file with
- Run the cluster and ensure data is available in
appandvmmdatabases. - Update DNS settings to point to new IP-addresses.
- Restart backend and worker services and monitor logs that MongoDB connection established successfully.
DRP test
- The test should be conducted using standalone MongoDB configuration.
- The test should be conducted using temporary VPC with default firewall settings prohibited any outside connection using MongoDB ports.
- The test doesn’t include steps related to production environment, like: configure monitoring (step 3), DNS update (step 8), and restart production services (step 9).
- The hosts and VPC should be deleted after the test results are recorded in this document.
- Disk requirements: ~100GB available disk space