Introduction
Key Drivers
What Guzzle is not
Guzzle Architecture Overview
Common Services
Job Dependency
Runtime audit
Data Endpoints
System Paramters
User Parameters
File Upload Tools
Watermarks and Control information
Performance Monitoring
Native (or Internal) Modules
Ingestion
Data Processing
Housekeeping
Constraint Check
Recon and Tractability
External Modules
Gobblin
ETL/ELT Tools
Data Prep tools

Introduction

Key Drivers

There are plethora of frameworks and data integration tools available for Big Data and traditional DWH usecases. The reason or if we can say motivation that we decided to put together Guzzle - a set of data integration frameworks, boils down to following:

To make Big data technologies more accessible and enable wider adoption for typical Data warehouse and data integration usecase. The idea is to simplify the implementation of data integration requirements of Data lake, make it faster, easy to manage, extend etc.
To address the standard data integration and DWH patterns that are usually not available as part of the native tools and standard frameworks.

What Guzzle is not

Guzzle is not meant to compete with existing Data Integration and ETL tools or existing ETL frameworks like Gobblin. Whilst it provides native modules to support Ingestion , Data processing, and others - it also supports calling the jobs / procs from other frameworks and ETL tools.

Guzzle Architecture Overview

Guzzle is build as combination of foundation or common services , native modules and external modules which come together to provide integrated set of accelerators which can achieve [Guzzle's goal of RACE OIL] (Documentation/guzzle-overview#guzzle-goals-race-oil)

Common Services

Job Dependency

Supports flexibility of defining dependency between different stages of data flow (staging, foundation, access layer)
Supports dependency of individual jobs within ETL stages
It passes appropriate context to allow concurrent loading of same target dataset
Tightly coupled with Data Load and Data Ingestion frameworks

Runtime audit

Run time audit to maintain granular logs of individual jobs and intermediate steps
Row counts of successful, exception records processed, start/end time
Performance metrics like CPU, memory and IO usage by the data processing jobs
All audits captured in the context of system, country, data loading stage (staging, foundation etc), table for ease of reporting

Data Endpoints

This is the abstractions for any data end points like local files, HDFS, RDBMS etc.

System Paramters

This are global parameters which are determine during the invocation of the jobs

User Parameters

This are additional paramters

File Upload Tools

A generic tool to upload and stage the file in HDFS/Unix

Watermarks and Control information

This deal with keeping the state of control information like last business date loaded for a given system/ country,.

Performance Monitoring

This deals with monitoring the granular resource usage on the cluster by various jobs

Native (or Internal) Modules

This are series of modules achieve specific workflows/ tasks for data integration. While they leverage the services/ context from Common services - they are supposedly to be fairly independent and can be run standalone. Native modules are loosely coupled and all the context is passed to this module while series of parameters (you can assume it passing a hash-map with key value of pairs)

Ingestion

Cateres to ingesting data from files, and relational database in batch mode and from Kafka in real-time mode
Performs schema validation, control checks, file format check
Allows configuring target partition scheme and incremental extraction criteria
Staleness handling for late arriving files
Supports end of day/month handling, merge, truncate insert and append modes on target

Data Processing

A generic data loading framework which allows defining the transformation and loading rules using declarative config
Data Processing rules defined as SQLs
Enforces consistent implementation of standards and design patterns
Prevent rewriting common of the ETL code and avoid any manual errors due to this
Allows to control performance and other relevant global parameters centrally

Housekeeping

Generic module to house keep the data
Allows configuring the housekeeping based on date columns as well as others
Allows configuring retention period for multiple time periods (xxx rolling days , yy rolling month end etc.)
The data falling outside of retention window can be purged or moved to alternate location

Constraint Check

Perform Data Quality (DQ) validation on specified columns and tables
Logging of records and statistics failing the constraint checks
The validation rules applicable for structured data and can currently specified as SQL

Recon and Tractability

Recon framework for technical recon between source and target datasets
Performs count, hash and sum checks
Maintain detail list of record (PK values/ rowid) having reconciliation gaps

External Modules

This are external frameworks and tools that are supported by Guzzle.

Gobblin

ETL/ELT Tools

ETL or ELT tools like ODI and Informatica can be integrated with Guzzle.

Data Prep tools

Data prep tools like Paxata, DataIKU, Trifacta, Data Mere can be orchestrate and hooked as external module

What is Guzzle - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Introduction

Key Drivers

What Guzzle is not

Guzzle Architecture Overview

Common Services

Job Dependency

Runtime audit

Data Endpoints

System Paramters

User Parameters

File Upload Tools

Watermarks and Control information

Performance Monitoring

Native (or Internal) Modules

Ingestion

Data Processing

Housekeeping

Constraint Check

Recon and Tractability

External Modules

Gobblin

ETL/ELT Tools

Data Prep tools

⚠️ GitHub.com Fallback ⚠️

What is Guzzle - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Introduction

Key Drivers

What Guzzle is not

Guzzle Architecture Overview

Common Services

Job Dependency

Runtime audit

Data Endpoints

System Paramters

User Parameters

File Upload Tools

Watermarks and Control information

Performance Monitoring

Native (or Internal) Modules

Ingestion

Data Processing

Housekeeping

Constraint Check

Recon and Tractability

External Modules

Gobblin

ETL/ELT Tools

Data Prep tools

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️