Guzzle Overview - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Overview

Guzzle is series of Data integration and Data warehouse frameworks more for Big Data platform to start with but to be extended to traditional DWH and RDBMS, extracts in the BI tools like Tableau, Qlik, and OLAP solutions like Analytical Services, Essbase.

Some of these frameworks have existed in JA for both big data and Oracle platforms. The plan is to rebuild this framework factoring in the learning and experiences from past implementation, new technologies and open source projects which complement.

Guzzle positioning material

  1. Guzzle Overview - Summary deck
  2. Guzzle Overview - Detail deck

Guzzle and Spark

Spark is at the core for all five the modules 

  1. Ingestion is all-and-all spark, raw data is read into spark for files/jdbc/apis/queuss directly into spark and written to sink (from spark). We will never support native ingestion capablities of Oracle or Teradata and like. Even for sinks like Casandra and elastic search we use native connectors which spark provides and nothing else. 
  2. ELT pattern is supported for Processing module - where a SQL is generated (INSERT INTO...SELECT or MERGE INTO ..) and submitted to DB
  3. Recon has push down optimization to generate and send summary query to DQ
  4. Check constraint brings raw data as per source SQL/table into spark and geneates DQ meric there. Plan is to have support for push down for this in future for JDBC end points
  5. Housekeeping - today only supports Hive and Delta - but eventually when it starts supporting JDBC, it will do push down (its in road map)

Guzzle Goals RACE OIL

Key Goals

Key goals of Guzzle is to build frameworks that are:

  1. Robust: Caters to resilience and robustness of building and deploying in data integration solutions. Fail-over and recovery, tight handling of control variables, exceptions, tracebility. Cater go straight through process, It should use cluster , support multiple parallel pipeline
  2. Accelerate: The whole purpose of building this frameworks is to accelerate the implementation of analytics marts and data lake projects. The framework should capture all the repetitive patterns which are required for typical data integration project and provide reasonable level abstractions and configurations(not too low level like traditional ETL tools).
  3. Consistency: The frameworks should bring the consistency and predictability in a deployment. Consistency also implies no redundancy. All the key patterns around data ingestion, transformation, logging and loading patterns should be be centralized.
  4. Extensible: Fully extensible in terms of sources and target supported, stages of data flow, instrumentation/logging and reporting, metadata, context of data integration

Other Goals

  1. Open source: To build on top of existing open source projects as much possible
  2. Integrated yet Modular: While each components are very well integrated and complement each other -it should support devt,enhancement and deployment of each of the module independently.
  3. Lightweight : The framework should be lightweight in terms of runtime overhead and effort to deploy, configure and hvae it up and running.
⚠️ **GitHub.com Fallback** ⚠️