Release Process - cloudigrade/cloudigrade GitHub Wiki
This document explains the release process for cloudigrade. cloudigrade relies on app-interface to deploy our code to the ephemeral, stage, and prod environments.
Our release process largely implements Continuous Delivery with App-SRE. Please read that document and related docs in app-interface for a broader understanding of this process and the related technologies.
Clusters:
- ephemeral is used for all ephemeral deployments, such as our pr_checks and local dev work.
- cloudigrade-stage is our "stage" deployment in the OSDv4
crcs02ue1.urby.p1
cluster - cloudigrade-prod is our "prod" deployment in the OSDv4
crcp01ue1.o9m8.p1
cluster
Automatic release process
To kick off a release, simply merge new code into master
. At that point, a new image will be built and uploaded to Quay, and it'll be deployed to stage.
After the stage deployment completes, IQE smoke tests will automatically run against the stage deployment.
After the IQE smoke tests pass in stage, the stage version is automaticaly promoted and released to prod.
Detailed explanation of automated release process
stage deployment
- deploy-clowder.yml's
stage
namespace defines an upstream dependency on thecloudigrade-cloudigrade-gh-build-master
job. - cloudigrade-cloudigrade-gh-build-master job is triggered by a GitHub webhook (configured according to this AppSRE Onboarding doc) that is managed by app-interface.
- deploy-clowder.yml's
stage
namespace's ref is "master". This ref determines what version of clowdapp.yaml will be used to deploy to stage. - openshift-saas-deploy pipeline runs with a name like
cloudigrade-clowder-insights-stage-*
. - Success/failure message is posted to #cloudigrade-cicd-events in CoreOS Slack.
- Deployment rolls out to cloudigrade-stage.
IQE smoke test in stage
- stage-smoke.yaml defines a ClowdJobInvocation for running stage smoke tests.
- Note
dynaconfEnvName:clowder_stage_smoke
andmarker:'smoke'
. - This requires iqe-cloudmeter-plugin's cloudmeter.default.yml to have a definition with the same name and for relevant tests to be marked with
smoke
.
- Note
- deploy-clowder.yml's
stage
namespace publishes tocloudigrade-stage-deploy-success-channel
. - test.yml
- defines a resource that uses stage-smoke.yaml as its template.
- subscribes to
cloudigrade-stage-deploy-success-channel
. - publishes to
cloudigrade-stage-post-deploy-tests-success-channel
.
- The test job triggers an openshift-saas-deploy pipeline run with a name like
cloudigrade-test-insights-stage-*
. - While that pipeline is running, a pod with a name like
cloudigrade-smoke-*-iqe-*
runs in cloudigrade-stage. Its lifetime is short, only existing for he duration of the test, and it will be deleted upon completion. - Success/failure message is posted to #cloudigrade-cicd-events in CoreOS Slack.
production deployment
- deploy-clowder.yml's
production
namespace subscribes tocloudigrade-stage-post-deploy-tests-success-channel
withauto:true
. - The App SRE bot creates an app-interface merge request (like this). The diff should only update
deploy-clowder.yml
'sproduction
namespace's ref. - Once required MR checks pass, the bot automatically merges the MR.
- Although the bot may post a comment like "changes to saas file 'cloudigrade-clowder' require approval (/lgtm)" to the MR, the presence of the
bot/automerge
label appears to override that requirement.
- Although the bot may post a comment like "changes to saas file 'cloudigrade-clowder' require approval (/lgtm)" to the MR, the presence of the
- openshift-saas-deploy pipeline runs with a name like
cloudigrade-clowder-insights-production-*
. - Success/failure message is posted to #cloudigrade-cicd-events in CoreOS Slack.
- Deployment rolls out to cloudigrade-prod.
What is our cadence for updating stage and prod?
Whenever code lands in the master
branch, it is considered tested and stable, a build_master job is kicked off that produces an image tagged with the hash of the master branch. If all goes normally, within about an hour that change should have made its way to production.
This could mean multiple times a day or maybe not even one a week. It all depends on when new code merges.
Troubleshooting failing PR checks
- Check recent cloudigrade-cloudigrade-pr-check jobs. A link to the specific job instance should have been posted to the GitHub PR.
- Note that the
cloudigrade-cloudigrade-pr-check
only runs one job serially. So, multiple PRs may be waiting in a queue of jobs. - If you are actively pushing more changes to your branch while the
cloudigrade-cloudigrade-pr-check
is running for your PR on a previous version, please find and cancel any old jobs for your PR so not to waste time. - PR check jobs consume an ephemeral environment. Try checking availability locally with
bonfire namespace list
.
Troubleshooting automatic promotions, smoke tests, and deployments
- Check the #cloudigrade-cicd-events in CoreOS Slack for recent messages.
- If one of the steps has failed unexpectedly, find the relevant
openshift-saas-deploy
pipeline run.- Check for any interesting logs.
- If it looks like a flaky environment issue, try restarting the pipeline run.
- If logs mention a pod with a name like
cloudigrade-smoke-*-iqe-*
, try finding it in cloudigrade-stage before it disappears. - If the pod may be stuck in a bad way, determine if it needs to be forcibly destroyed.
- IQE smoke test results are sent to Ibutsu.
- Set the active project at the top of the page to
Insights QE
. - Try searching test results with
component=cloudmeter
andenv=stage
with asmoke
marker (can't filter on that in the search, though).
- Set the active project at the top of the page to
- If IQE smoke tests appear to be failing for reasons unrelated to our code changes, contact @parag in #cloudmeter-dev or ask for help in #forum-consoledot-qe Ansible Slack.
- Is stage not deploying after you merged code changes? Look at o-openshift-saas-deploy-cloudigrade-clowder recent runs and the pods associated with them. You might need to rerun one manually. If the deploy pods are OOM killed, you may need to bump its resources again.
What if there is a problem deploying to stage or production we cannot resolve?
Contact @crc-devprod-team in the #forum-clouddot channel of the CoreOS Slack.
What if there is a problem with production?
Please see Escalation Procedures - cloud.redhat.com.