Data Fusion - bobbae/gcp GitHub Wiki
Data Fusion is a fully managed, cloud-native, scalable data integration platform. It is a GUI based data integration service for building and managing data pipelines.
Data Fusion is built using open source project CDAP, and this open core ensures data pipeline portability for users. CDAP’s broad integration with on-premises and public cloud platforms gives Cloud Data Fusion users the ability to break down silos and deliver insights that were previously inaccessible.
https://cloud.google.com/data-fusion
CDAP is an integrated, open source application development platform for the Hadoop ecosystem that provides developers with data and application abstractions to simplify and accelerate application development, address a broader range of real-time and batch use cases, and deploy applications into production while satisfying enterprise requirements.
https://github.com/cdapio/cdap
Under the hood, Data Fusion executes pipelines on a Dataproc cluster. Data Fusion automatically converts GUI based pipelines into Dataproc jobs for execution whenever a pipeline is executed. It supports two execution engine options: MapReduce and Apache Spark.
CDAP Example Apps
https://github.com/cdapio/cdap-apps
Modern, more secure data lakes on Google Cloud
Cloud Data Fusion helps users build scalable, distributed data lakes on Google Cloud by integrating data from siloed on-premises platforms. Customers can leverage the scale of the cloud to centralize data and drive more value out of their data as a result.
https://towardsdatascience.com/building-a-data-lake-on-gcp-with-cdap-6271c264f22e
The self-service capabilities of Cloud Data Fusion increase process visibility and lower the overall cost of operational support.
https://cloud.google.com/data-fusion/plugins
Choosing Dataflow, Dataprep, Dataproc or Data Fusion
https://wisdomplexus.com/blogs/dataproc-vs-dataflow-vs-dataprep/
Wrangler UI
https://datadice.medium.com/the-wrangler-user-interface-of-cloud-data-fusion-21e96cfe1b2d
Agile data warehouses with BigQuery
Cloud Data Fusion can help organizations better understand their customers by breaking down data silos and enabling development of agile, cloud-based data warehouse solutions in BigQuery. A trusted, unified view of customer engagement and behavior unlocks the ability to drive a better customer experience, which leads to higher retention and higher revenue per customer.
https://cloud.google.com/data-fusion/docs/tutorials/replicating-data/mysql-to-bigquery
Unified analytics environment
Many users today want to establish a unified analytics environment across a myriad of expensive, on-premises data marts. Employing a wide range of disconnected tools and stop-gap measures creates data quality and security challenges. Cloud Data Fusion’s vast variety of connectors, visual interfaces, and abstractions centered around business logic helps in lowering TCO, promoting self-service and standardization, and reducing repetitive work.
Data Fusion Features
https://cloud.google.com/data-fusion#section-10
UDD
https://medium.com/cdapio/a-definitive-guide-to-wrangler-user-defined-directive-udd-b1dc2d795ce6
Upload UDDs through REST API
https://jtaras.medium.com/cloud-data-fusion-upload-udds-through-the-rest-api-32d015de1e22
Data Fusion Secure Store
CCAI
Ephemeral Data Fusion using Terraform
Examples
Import CSV data into BigQuery using Data Fusion
https://codelabs.developers.google.com/codelabs/batch-csv-cdf-bq
End-to-end automated Analytics
Cloud Storage upload trigger Data Fusion ETL Pipelines
Config driven Pipeline
https://jtaras.medium.com/cloud-data-fusion-building-config-driven-pipelines-a40d23e9d011
Creating reusable pipeline
https://cloud.google.com/data-fusion/docs/tutorials/reusable-pipeline
Connect SAP data via Cloud Data Fusion
Google Cloud Cortex Framework
https://cloud.google.com/solutions/cortex
SAP SLT Replication plugin
https://cloud.google.com/data-fusion/docs/how-to/use-sap-slt-plugin
Reverse ETL
https://jtaras.medium.com/cloud-data-fusion-reverse-etl-from-bigquery-to-cloudsql-e8aa1e804ba5
Tutorials
https://cloud.google.com/data-fusion/docs/tutorials
Using Cloud Vision API from Data Fusion
Customize compute profiles at Runtime
https://jtaras.medium.com/cloud-data-fusion-customizing-compute-profiles-at-runtime-6420f9b185ba
Update deployed pipelines through REST API
https://jtaras.medium.com/cloud-data-fusion-update-deployed-pipelines-through-rest-api-2be0ad25e25f
Building job metadata pipelines
https://jtaras.medium.com/cloud-data-fusion-building-job-metadata-pipelines-356a5a0f6a83