CI CD in Guzzle - ja-guzzle/guzzle_docs GitHub Wiki

Motivation

We want to allow user to not just build data piplines faster using different Guzzles job types like: Ingestion, Processing etc - but also be able to effectively integrate (changes from multiple streams), test and regress it and then deploy it.

Background

  1. Guzzle provides tight integration with GIT
  2. Guzzle bundles test automation framework /lib
  3. A typical data project has complex packaging, deployment and data loading requirements - we want to initially focus on creating Guzzle package that can be promoted from one env to other env
  4. Dev-ops stack, change management processes, operations processes, vary customer by customer - however we want to prescribe one blueprint for a given stack - typically for Guzzles cloud customer on Azure wiht assumption that parts of this should be repeatable for on premise deployment

References

  1. This one talks about the challenges of CI/CD for data projects: https://medium.com/90seconds/continuous-integration-and-deployment-for-data-pipelines-at-90-seconds-53bf10521ea7
  2. We should look at how this is supported for other stacks - ADF
  3. We should look at inheriting relevant best practices from what CI/CD is done for non-Data world example apps build using Java etc.
  4. Some relevant discussion - not a whole lot https://www.reddit.com/r/devops/comments/88c5ec/cicd_for_etl_and_dw/
  5. This link is not valid as they are using for Jenking to orchestrate the jobs : https://techblog.livongo.com/jenkins-etl/ (ast
  6. A good read on the stack that Stitch uses. Stitch like fivtran provides Data integraiton and replication form cloud app as a service : https://dzone.com/articles/the-tools-we-used-to-build-our-etl-pipeline-platfo
  7. https://www.concentra.co.uk/resources/articles/continuous-integration-in-data-warehouse-development/
  8. This is more like marketing showcase - but there is some real project being spoken about: https://www.infosys.com/IT-services/validation-solutions/white-papers/Documents/seven-step-framework-CICD-ETL-testing.pdf

Coverage

Lets come up with a to-be process which can be followed as part of Guzzle deployment of medium and large enterprises who want to follow automation of deployment

  1. Handling of deploying incremental or full packages from Env A to B - the packages as much possible are generated in Git by privileged user with tag create in the repo
  2. Test automation as part of CI pipelines
  3. Validation and checks as part of CD pipeline
  4. Support for manual workflow for both CI and CD pipliens
  5. Ability to handle custom scripts which are either re-runnable (example: created stored procedures) and those which are not re-runnable (like create/alter table, one time data load int config and data tables).
  6. Cleanup /rollabck scripts: to restore guzzle env with all the "default" and "instance" to old values ; for custom script both re-runnable and non-reunnable - we can suggest simple backup and restore of the DB (or explicit script bundled by the devt team)
  7. Generating the package and staging it for subsequent hand-off
⚠️ **GitHub.com Fallback** ⚠️