Workflows - bobbae/gcp GitHub Wiki
Cloud Workflows let you orchestrate and automate Google Cloud and HTTP-based API services with serverless workflows.
There are many kinds of tools to implement different types of workflows.
Google Cloud’s first general-purpose workflow orchestration tool was Cloud Composer. Based on Apache Airflow, Cloud Composer is great for data engineering pipelines like ETL orchestration, big data processing, or machine learning workflows, and integrates well with data products like BigQuery or Dataflow. Cloud Composer is a natural choice if your workflow needs to run a series of jobs in a data warehouse or big data cluster, and save results to a storage bucket.
If you want to process events or chain APIs in a serverless way, or have workloads that are bursty or latency-sensitive, it may be better to use Workflows.
https://cloud.google.com/blog/products/application-development/get-to-know-google-cloud-workflows
https://medium.com/google-cloud/gcp-cloud-workflows-orchestrate-in-declarative-way-3cfacda25028
Orchestration vs Choreography
Google Cloud provides services supporting both Orchestration and Choreography approaches.
Pub/Sub and Eventarc are both suited for choreography of event-driven services, whereas Workflows is suited for centrally orchestrated services.
Workflows is a service to orchestrate not only Google Cloud services, such as Cloud Functions and Cloud Run, but also external services. Should there be a central orchestrator controlling all interactions between services or should each service work independently and only interact through events? This is the central question in the Orchestration vs Choreography debate.
In Orchestration, a central service defines and controls the flow of communication between services. With centralization, it becomes easier to change and monitor the flow and apply consistent timeout and error policies.
In Choreography, each service registers for and emits events as they need. There’s usually a central event broker to pass messages around, but it does not define or direct the flow of communication. This allows services that are truly independent at the expense of less traceable and manageable flow and policies.
Connectors
Workflows publish connectors to make it easier to access other Google Cloud products within a workflow. Connectors can be used to connect to other Google Cloud APIs within a workflow, helping you integrate your workflows with other Google Cloud products. For example, you can use connectors to publish Pub/Sub messages, read or write data to a Firestore database, or retrieve authentication keys from Secret Manager.
https://cloud.google.com/workflows/docs/reference/googleapis
Workflows syntax cheatsheet
https://cloud.google.com/workflows/docs/reference/syntax/syntax-cheat-sheet
Connectors Samples
https://cloud.google.com/workflows/docs/connectors-samples
Connectors Workflows Samples on GitHub
https://github.com/GoogleCloudPlatform/workflows-samples/tree/main/src/connectors
Connectors Workflows demos
https://github.com/GoogleCloudPlatform/workflows-demos/tree/master/connector-compute
Callbacks
https://cloud.google.com/blog/topics/developers-practitioners/introducing-workflows-callbacks
Replicate data from BigQuery to Cloud SQL using Cloud Workflows
https://medium.com/google-cloud/replicate-data-from-bigquery-to-cloud-sql-2b23a08c52b1
Parallel Steps for Workflows
https://cloud.google.com/workflows/docs/execute-parallel-steps
Workflows Tips
https://medium.com/google-cloud/workflows-tipsn-tricks-d7196eb5098d
Other tools related to workflows
Google Cloud Composer
The Cloud Composer is a fully managed workflow orchestration built on Apache Airflow.
Google Dataproc Workflow
The Dataproc WorkflowTemplates API provides a flexible and easy-to-use mechanism for managing and executing workflows. A Workflow Template is a reusable workflow configuration. It defines a graph of jobs with information on where to run those jobs.
Dataflow
Dataflow is a managed service for executing a wide variety of data processing patterns.
Kubeflow & AI Hub
Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on containers. AI Hub has support for Kubeflow Pipelines.
Workflow Tools Comparison and Alternatives
A survey of pipeline & workflow tools.
Choosing the right Orchestrator for Google Cloud
Orchestration often refers to the automated configuration, coordination, and management of computer systems and services.
In the context of service-oriented architectures, orchestration can range from executing a single service at a specific time and day, to a more sophisticated approach of automating and monitoring multiple services over longer periods of time, with the ability to react and handle failures as they crop up. In the data engineering context, orchestration is central to coordinating the services and workflows that prepare, ingest, and transform data. It can go beyond data processing and also involve a workflow to train a machine learning (ML) model from the data.
Google Cloud Platform offers a number of tools and services for orchestration:
- Cloud Scheduler for schedule-driven single-service orchestration
- Workflows for complex multi-service orchestration
- Cloud Composer for orchestration of your data workloads
Saga pattern to deal with failures in Workflows
https://cloud.google.com/blog/topics/developers-practitioners/implementing-saga-pattern-workflows
Send email using Sendgrid in Workflows
https://glaforge.appspot.com/article/sending-an-email-with-sendgrid-from-workflows
Loading JSON data from GCS in Workflows
https://glaforge.appspot.com/article/load-and-use-json-data-in-your-workflow-from-gcs
Using Secret Manager Connector for Workflows
Data Orchestration using Workflows
Airflow vs. Luigi vs. Argo vs. MLFlow vs. Kubeflow
Review the comparisons of Airflow, Luigi, Argo, MLflow and Kubeflow
Prefect
https://github.com/PrefectHQ/prefect
PyDag
Snakemake
https://github.com/snakemake/snakemake
Visual Editor
https://medium.com/@kolban1/gcp-workflows-visual-editor-9876fb1c823f
Bioinformatics workflow tools
https://github.com/danielecook/Awesome-Bioinformatics#workflow-managers
Pipelines
https://github.com/pditommaso/awesome-pipeline
Using Makefile for workflows
https://blog.sellorm.com/2018/06/02/first-steps-with-data-pipelines/
Make map-reduce pipeline
Reproducibility with Make
https://the-turing-way.netlify.app/reproducible-research/make.html
Make tutorial
https://github.com/kyclark/make-tutorial
Parallel
https://www.gnu.org/software/parallel/
Examples
Quickstarts
https://cloud.google.com/workflows/docs/create-workflow-console
Use Workflows with Cloud Run and Cloud Functions tutorial
https://cloud.google.com/workflows/docs/run/tutorial-cloud-run
Run a batch translation using the Cloud Translation connector
https://cloud.google.com/workflows/docs/tutorial-translation-connector
Use GCP Workflows to load data from GCS to BigQuery
Long-running jobs with Cloud Workflows
https://medium.com/google-cloud/long-running-job-with-cloud-workflows-38b57bea74a5
Reading and writing JSON files from a Workflow
ML Pipelines with Workflows
https://cloud.google.com/community/tutorials/ml-pipeline-with-workflows
Snapshot BigQuery dataset with Cloud Workflows
https://medium.com/google-cloud/bigquery-snapshot-dataset-with-cloud-workflow-5175eb8df00b
Large-scale bioinformatics in the cloud with GCP, Kubernetes, and Snakemake
Perform a large metagenomics sequencing experiment – 96 10X Genomics linked read libraries sequenced across 25 lanes on a HiSeq4000 in GCP.
Cromwell
https://medium.com/google-cloud/cromwell-hello-gcp-833c18df3caf
Analyzing Twitter sentiment
Store Workflows states in Firestore
https://medium.com/google-cloud/worklows-state-management-with-firestore-99237f08c5c5
Executing commands from Workflows
https://medium.com/google-cloud/executing-commands-gcloud-kubectl-from-workflows-ad6b85eaf39c