What is Guzzle - ja-guzzle/guzzle_docs GitHub Wiki
- Introduction
- Key Drivers
- What Guzzle is not
- Guzzle Architecture Overview
- Common Services
- Job Dependency
- Runtime audit
- Data Endpoints
- System Paramters
- User Parameters
- File Upload Tools
- Watermarks and Control information
- Performance Monitoring
- Native (or Internal) Modules
- Ingestion
- Data Processing
- Housekeeping
- Constraint Check
- Recon and Tractability
- External Modules
- Gobblin
- ETL/ELT Tools
- Data Prep tools
There are plethora of frameworks and data integration tools available for Big Data and traditional DWH usecases. The reason or if we can say motivation that we decided to put together Guzzle - a set of data integration frameworks, boils down to following:
- To make Big data technologies more accessible and enable wider adoption for typical Data warehouse and data integration usecase. The idea is to simplify the implementation of data integration requirements of Data lake, make it faster, easy to manage, extend etc.
- To address the standard data integration and DWH patterns that are usually not available as part of the native tools and standard frameworks.
Guzzle is not meant to compete with existing Data Integration and ETL tools or existing ETL frameworks like Gobblin. Whilst it provides native modules to support Ingestion , Data processing, and others - it also supports calling the jobs / procs from other frameworks and ETL tools.
Guzzle is build as combination of foundation or common services , native modules and external modules which come together to provide integrated set of accelerators which can achieve [Guzzle's goal of RACE OIL] (Documentation/guzzle-overview#guzzle-goals-race-oil)
- Supports flexibility of defining dependency between different stages of data flow (staging, foundation, access layer)
- Supports dependency of individual jobs within ETL stages
- It passes appropriate context to allow concurrent loading of same target dataset
- Tightly coupled with Data Load and Data Ingestion frameworks
- Run time audit to maintain granular logs of individual jobs and intermediate steps
- Row counts of successful, exception records processed, start/end time
- Performance metrics like CPU, memory and IO usage by the data processing jobs
- All audits captured in the context of system, country, data loading stage (staging, foundation etc), table for ease of reporting
This is the abstractions for any data end points like local files, HDFS, RDBMS etc.
This are global parameters which are determine during the invocation of the jobs
This are additional paramters
A generic tool to upload and stage the file in HDFS/Unix
This deal with keeping the state of control information like last business date loaded for a given system/ country,.
This deals with monitoring the granular resource usage on the cluster by various jobs
This are series of modules achieve specific workflows/ tasks for data integration. While they leverage the services/ context from Common services - they are supposedly to be fairly independent and can be run standalone. Native modules are loosely coupled and all the context is passed to this module while series of parameters (you can assume it passing a hash-map with key value of pairs)
- Cateres to ingesting data from files, and relational database in batch mode and from Kafka in real-time mode
- Performs schema validation, control checks, file format check
- Allows configuring target partition scheme and incremental extraction criteria
- Staleness handling for late arriving files
- Supports end of day/month handling, merge, truncate insert and append modes on target
- A generic data loading framework which allows defining the transformation and loading rules using declarative config
- Data Processing rules defined as SQLs
- Enforces consistent implementation of standards and design patterns
- Prevent rewriting common of the ETL code and avoid any manual errors due to this
- Allows to control performance and other relevant global parameters centrally
- Generic module to house keep the data
- Allows configuring the housekeeping based on date columns as well as others
- Allows configuring retention period for multiple time periods (xxx rolling days , yy rolling month end etc.)
- The data falling outside of retention window can be purged or moved to alternate location
- Perform Data Quality (DQ) validation on specified columns and tables
- Logging of records and statistics failing the constraint checks
- The validation rules applicable for structured data and can currently specified as SQL
- Recon framework for technical recon between source and target datasets
- Performs count, hash and sum checks
- Maintain detail list of record (PK values/ rowid) having reconciliation gaps
This are external frameworks and tools that are supported by Guzzle.
ETL or ELT tools like ODI and Informatica can be integrated with Guzzle.
Data prep tools like Paxata, DataIKU, Trifacta, Data Mere can be orchestrate and hooked as external module