Release Process - cloudigrade/cloudigrade GitHub Wiki

This document explains the release process for cloudigrade. cloudigrade relies on app-interface to deploy our code to the ephemeral, stage, and prod environments.

Our release process largely implements Continuous Delivery with App-SRE. Please read that document and related docs in app-interface for a broader understanding of this process and the related technologies.

Clusters:

ephemeral is used for all ephemeral deployments, such as our pr_checks and local dev work.
cloudigrade-stage is our "stage" deployment in the OSDv4 crcs02ue1.urby.p1 cluster
cloudigrade-prod is our "prod" deployment in the OSDv4 crcp01ue1.o9m8.p1 cluster

Automatic release process

To kick off a release, simply merge new code into master. At that point, a new image will be built and uploaded to Quay, and it'll be deployed to stage.

After the stage deployment completes, IQE smoke tests will automatically run against the stage deployment.

After the IQE smoke tests pass in stage, the stage version is automaticaly promoted and released to prod.

Detailed explanation of automated release process

stage deployment

deploy-clowder.yml's stage namespace defines an upstream dependency on the cloudigrade-cloudigrade-gh-build-master job.
cloudigrade-cloudigrade-gh-build-master job is triggered by a GitHub webhook (configured according to this AppSRE Onboarding doc) that is managed by app-interface.
deploy-clowder.yml's stage namespace's ref is "master". This ref determines what version of clowdapp.yaml will be used to deploy to stage.
openshift-saas-deploy pipeline runs with a name like cloudigrade-clowder-insights-stage-*.
Success/failure message is posted to #cloudigrade-cicd-events in CoreOS Slack.
Deployment rolls out to cloudigrade-stage.

IQE smoke test in stage

stage-smoke.yaml defines a ClowdJobInvocation for running stage smoke tests.
- Note dynaconfEnvName:clowder_stage_smoke and marker:'smoke'.
- This requires iqe-cloudmeter-plugin's cloudmeter.default.yml to have a definition with the same name and for relevant tests to be marked with smoke.
deploy-clowder.yml's stage namespace publishes to cloudigrade-stage-deploy-success-channel.
test.yml
- defines a resource that uses stage-smoke.yaml as its template.
- subscribes to cloudigrade-stage-deploy-success-channel.
- publishes to cloudigrade-stage-post-deploy-tests-success-channel.
The test job triggers an openshift-saas-deploy pipeline run with a name like cloudigrade-test-insights-stage-*.
While that pipeline is running, a pod with a name like cloudigrade-smoke-*-iqe-* runs in cloudigrade-stage. Its lifetime is short, only existing for he duration of the test, and it will be deleted upon completion.
Success/failure message is posted to #cloudigrade-cicd-events in CoreOS Slack.

production deployment

deploy-clowder.yml's production namespace subscribes to cloudigrade-stage-post-deploy-tests-success-channel with auto:true.
The App SRE bot creates an app-interface merge request (like this). The diff should only update deploy-clowder.yml's production namespace's ref.
Once required MR checks pass, the bot automatically merges the MR.
- Although the bot may post a comment like "changes to saas file 'cloudigrade-clowder' require approval (/lgtm)" to the MR, the presence of the bot/automerge label appears to override that requirement.
openshift-saas-deploy pipeline runs with a name like cloudigrade-clowder-insights-production-*.
Success/failure message is posted to #cloudigrade-cicd-events in CoreOS Slack.
Deployment rolls out to cloudigrade-prod.

What is our cadence for updating stage and prod?

Whenever code lands in the master branch, it is considered tested and stable, a build_master job is kicked off that produces an image tagged with the hash of the master branch. If all goes normally, within about an hour that change should have made its way to production.

This could mean multiple times a day or maybe not even one a week. It all depends on when new code merges.

Troubleshooting failing PR checks

Check recent cloudigrade-cloudigrade-pr-check jobs. A link to the specific job instance should have been posted to the GitHub PR.
Note that the cloudigrade-cloudigrade-pr-check only runs one job serially. So, multiple PRs may be waiting in a queue of jobs.
If you are actively pushing more changes to your branch while the cloudigrade-cloudigrade-pr-check is running for your PR on a previous version, please find and cancel any old jobs for your PR so not to waste time.
PR check jobs consume an ephemeral environment. Try checking availability locally with bonfire namespace list.

Troubleshooting automatic promotions, smoke tests, and deployments

Check the #cloudigrade-cicd-events in CoreOS Slack for recent messages.
If one of the steps has failed unexpectedly, find the relevant openshift-saas-deploy pipeline run.
- Check for any interesting logs.
- If it looks like a flaky environment issue, try restarting the pipeline run.
- If logs mention a pod with a name like cloudigrade-smoke-*-iqe-*, try finding it in cloudigrade-stage before it disappears.
- If the pod may be stuck in a bad way, determine if it needs to be forcibly destroyed.
IQE smoke test results are sent to Ibutsu.
- Set the active project at the top of the page to Insights QE.
- Try searching test results with component=cloudmeter and env=stage with a smoke marker (can't filter on that in the search, though).
If IQE smoke tests appear to be failing for reasons unrelated to our code changes, contact @parag in #cloudmeter-dev or ask for help in #forum-consoledot-qe Ansible Slack.
Is stage not deploying after you merged code changes? Look at o-openshift-saas-deploy-cloudigrade-clowder recent runs and the pods associated with them. You might need to rerun one manually. If the deploy pods are OOM killed, you may need to bump its resources again.

What if there is a problem deploying to stage or production we cannot resolve?

Contact @crc-devprod-team in the #forum-clouddot channel of the CoreOS Slack.

What if there is a problem with production?

Please see Escalation Procedures - cloud.redhat.com.