Disaster recovery plan - codemagic-ci-cd/company-handbook GitHub Wiki

Disaster recover objectives

This plan assumes all MongoDB nodes are unavailable and describes replacing the cluster with restoring data from backup files.

Checklist before start
- GCP us-east1-b zone operational as expected
- Granted access to download latest backup files
Start new instances for the new cluster using the same instance type and MongoDB version as existing cluster.
Provisioning new nodes to setup monitoring (see https://github.com/codemagic-dev/ansible/blob/main/setup_grafana_monitoring.yml)
Ensure MongoDB connected to the new cluster.
Download latest backup files to the master host.
Restore the files in the following order:
1. backup file with all prefix
2. applications file
3. teams file
4. users file
5. audit_log file
Run the cluster and ensure data is available in app and vmm databases.
Update DNS settings to point to new IP-addresses.
Restart backend and worker services and monitor logs that MongoDB connection established successfully.

The test should be conducted using standalone MongoDB configuration.
The test should be conducted using temporary VPC with default firewall settings prohibited any outside connection using MongoDB ports.
The test doesn’t include steps related to production environment, like: configure monitoring (step 3), DNS update (step 8), and restart production services (step 9).
The hosts and VPC should be deleted after the test results are recorded in this document.
- Disk requirements: ~100GB available disk space