CI CD in Guzzle - ja-guzzle/guzzle_docs GitHub Wiki
We want to allow user to not just build data piplines faster using different Guzzles job types like: Ingestion, Processing etc - but also be able to effectively integrate (changes from multiple streams), test and regress it and then deploy it.
- Guzzle provides tight integration with GIT
- Guzzle bundles test automation framework /lib
- A typical data project has complex packaging, deployment and data loading requirements - we want to initially focus on creating Guzzle package that can be promoted from one env to other env
- Dev-ops stack, change management processes, operations processes, vary customer by customer - however we want to prescribe one blueprint for a given stack - typically for Guzzles cloud customer on Azure wiht assumption that parts of this should be repeatable for on premise deployment
- This one talks about the challenges of CI/CD for data projects: https://medium.com/90seconds/continuous-integration-and-deployment-for-data-pipelines-at-90-seconds-53bf10521ea7
- We should look at how this is supported for other stacks - ADF
- We should look at inheriting relevant best practices from what CI/CD is done for non-Data world example apps build using Java etc.
- Some relevant discussion - not a whole lot https://www.reddit.com/r/devops/comments/88c5ec/cicd_for_etl_and_dw/
- This link is not valid as they are using for Jenking to orchestrate the jobs : https://techblog.livongo.com/jenkins-etl/ (ast
- A good read on the stack that Stitch uses. Stitch like fivtran provides Data integraiton and replication form cloud app as a service : https://dzone.com/articles/the-tools-we-used-to-build-our-etl-pipeline-platfo
- https://www.concentra.co.uk/resources/articles/continuous-integration-in-data-warehouse-development/
- This is more like marketing showcase - but there is some real project being spoken about: https://www.infosys.com/IT-services/validation-solutions/white-papers/Documents/seven-step-framework-CICD-ETL-testing.pdf
Lets come up with a to-be process which can be followed as part of Guzzle deployment of medium and large enterprises who want to follow automation of deployment
- Handling of deploying incremental or full packages from Env A to B - the packages as much possible are generated in Git by privileged user with tag create in the repo
- Test automation as part of CI pipelines
- Validation and checks as part of CD pipeline
- Support for manual workflow for both CI and CD pipliens
- Ability to handle custom scripts which are either re-runnable (example: created stored procedures) and those which are not re-runnable (like create/alter table, one time data load int config and data tables).
- Cleanup /rollabck scripts: to restore guzzle env with all the "default" and "instance" to old values ; for custom script both re-runnable and non-reunnable - we can suggest simple backup and restore of the DB (or explicit script bundled by the devt team)
- Generating the package and staging it for subsequent hand-off