Bazel remote caching disaster recovery plan - magma/magma GitHub Wiki

:warning: As of the 4th of January 2023 the bazel remote cache for magma is deprecated, see #14796.

:warning: Instructions on how to re-deploy the remote cache can be found in the magma/ci-infra repository.

Disaster recovery in the case of poisoned caches

  • Option 1: Delete the entire cache (requires magma AWS access)
    • Step 1: Update the service to have desired task count zero. Wait for running tasks to stop.
    • Step 2: Manually empty the s3 bucket prefixed with "bazel-remote-cache-".
    • Step 3: Update the service to have desired task count one.
  • Option 2: Delete and redeploy the entire remote caching setup (requires magma AWS and ci-infra repo access)
    • Step 1: Tear down the bazel-remote infrastructure by running terraform destroy -target=aws_s3_bucket.S3CacheBucket (the force_destroy = true options needs to have been enabled in the terraform code).
    • Step 2: Redeploy with terraform init and terraform apply.
  • Option 3: Invalidate the cache keys (requires CI codeowner approval)
    • With bazel-remote this can be done by changing the --remote_cache URL e.g. from https://user:pw@url:9090/current-cache to https://user:pw@url:9090/new-cache.
    • This needs to be changed in all affected workflows and might require rebasing.

Disaster recovery in the case of broken remote cache service

  • Option 1: Delete and redeploy the entire remote caching setup (requires magma AWS and ci-infra repo access)
    • Tear down the bazel-remote infrastructure by running terraform destroy -target=aws_s3_bucket.S3CacheBucket (the force_destroy = true options needs to have been enabled in the terraform code). Then redeploy with terraform init and terraform apply.
  • Option 2: Re-implement the GH caches (requires CI codeowner approval).