composer - bobbae/gcp GitHub Wiki
Cloud Composer is a managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.
Cloud Composer automation helps you create Airflow environments quickly and use Airflow-native tools, such as the powerful Airflow web interface and command line tools.
Cloud Composer is a managed workflow orchestration service built on Apache Airflow.
Google runs this open source orchestration platform on top of a Google Kubernetes Engine (GKE) cluster.
This cluster manages the Airflow workers, and opens up a host of integration opportunities with other Google Cloud products.
Environment is a core concept in Cloud Composer.
You can create one or more Cloud Composer environments inside of a project.
Environments are self-contained Airflow deployments based on Google Kubernetes Engine which can scale. You can scale composer environments by adjusting number of nodes or schedulers.
These environments work with Google Cloud services through connectors that are built into Airflow. You create Cloud Composer environments in supported regions, and the environments run within a Compute Engine zone.
For simple use cases, you can create one environment in one region. For complex use cases, you can create multiple environments within a single region or across multiple regions.
Airflow communicates with other Google Cloud products through the products' public APIs.
https://cloud.google.com/composer/docs/run-apache-airflow-dag
https://dsstream.com/differences-between-airflow-1-10-x-and-2-0/
https://medium.com/badal-io/airflow-2-development-environment-on-gcp-cloud-shell-5534b829e19a
https://cloud.google.com/composer/docs/composer-2/environment-architecture
Cloud Composer lets you author workflows with a Python API, schedule them to run automatically or start them manually, and monitor the execution of their tasks in real time through a graphical UI.
Built-in integration with BigQuery, Dataflow, Dataproc, Datastore, Cloud Storage, Pub/Sub, AI Platform.
During environment creation, Cloud Composer provides the following configuration options: Cloud Composer environment with a route-based GKE cluster (default), Private IP Cloud Composer environment, Cloud Composer environment with a VPC Native GKE cluster using alias IP addresses, Shared VPC.
https://cloud.google.com/composer/docs/concepts/private-ip
https://cloud.google.com/composer/docs/how-to/using/writing-dags
https://www.youtube.com/watch?v=GeNFEtt-D4k
https://www.youtube.com/watch?v=RrKXZcKOz4A
https://mkuthan.github.io/blog/2022/03/15/gcp-cloud-composer-tuning/
https://cloud.google.com/composer/docs/tutorials
Airflow is a platform to programmatically author, schedule and monitor workflows.
Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
Airflow is a micro-service architected framework. To deploy Airflow in a distributed setup, Cloud Composer provisions several Google Cloud components, which are collectively known as a Cloud Composer environment.
Stop VM, take snapshots, start VMs, using python API and airflow. Each step is put into an operator. Collection of operators are put into a Plugin. DAGs are composed of each step operator put into a pipeline. DAGs and Plugins are copied into Cloud Storage and run as Cloud Composer jobs.
https://cloud.google.com/solutions/automating-infrastructure-using-cloud-composer
https://airflow.apache.org/docs/stable/concepts.html
https://www.youtube.com/watch?v=YWtfU0MQZ_4
http://michal.karzynski.pl/blog/2017/03/19/developing-workflows-with-apache-airflow/
An Operator is an atomic block of workflow logic, which performs a single action. Operators are written as Python classes (subclasses of BaseOperator), where the init function can be used to configure settings for the task and a method named execute is called when the task instance is executed.
https://airflow.apache.org/docs/stable/concepts.html#xcoms
https://medium.com/swlh/industrialization-of-a-ml-model-using-airflow-and-apache-beam-5a5338f20184
https://medium.com/flyr-labs-blog/why-were-switching-off-airflow-sort-of-780c4f58a660
https://itsvit.com/blog/google-cloud-composer-vs-astronomer-what-to-choose/
Note that creating Composer Environment will allocate VMs, create Kubernetes cluster on them, create Cloud Storage bucket (named in a pattern as in region-composer-env-name-number) where the DAGs and code are stored. Creation of Composer environment can take a while and deleting the Environment afterwards will clean up most of the resources, including the VMs and Kubernetes clusters. But they will not delete the BigQuery data created as a result. Deleting Composer Environment also does not delete the auto-created GCS bucket. It has do be deleted manually.
https://cloud.google.com/architecture/automating-infrastructure-using-cloud-composer
https://cloud.google.com/scheduler/docs/start-and-stop-compute-engine-instances-on-a-schedule
https://cloud.google.com/composer/docs/how-to/using/using-dataflow-template-operator
https://cloud.google.com/composer/docs/tutorials/health-check
https://cloud.google.com/architecture/cicd-pipeline-for-data-processing
https://cloud.google.com/dataproc/docs/tutorials/workflow-composer
https://medium.com/@kolban1/composer-invoking-long-running-services-4de2dfa5e33a
https://medium.com/google-cloud/composer-sendgrid-and-secrets-75e4b6e7581e
https://towardsdatascience.com/are-you-using-cloud-functions-for-event-based-processing-adb3ef35aba6
https://towardsdatascience.com/connect-airflow-worker-gcp-e79690f3ecea
https://medium.com/@jasperfast/google-cloud-composer-ci-cd-f469a09c9db8
Cloud Composer: Copying BigQuery Tables Across Different Locations