Guzzle Best Practices - ja-guzzle/guzzle_docs GitHub Wiki
- Purpose
- Best Practices
- Context columns and Stages
- Organization and Naming
- GIT Integration
- Guzzle Environments
- Usage of Import feature in Guzzle
- Usage of Parameters in Guzzle jobs
- Organizing tables in databases and Schema for Hive/Delta
- Defining Logical and Physical end point and the grains
- Job Groups
- Delta vs Hive table
- Ingestion vs Processing
- Reporting Cache
- Usage of Data end-point Environment
- Git for Guzzle
This page captures Guzzle best practices. Guzzle is a powerful tool allowing you to make sophisticated data integration pipelines. This documents guides on effectively using Guzzle for real-world data projects ensuring resilience in the data pipelines as well faster innovation
- Based on the project current and future requirement, define the guzzle More detail on Guzzle context param are captured in: https://github.com/ja-guzzle/docs/wikis/Documentation/Guzzle-Parameters
- Its recommend to keep context column minimal and depending on their applicability to different data pipelines - as this becomes context param which are prompted for all the jobs and shows most of the report tables
- Stages should be defined if the Guzzle Batch framework is being planned to be used - now or in future. The list of stages can be based on current and future requirement and recommended stages are: STG, FND, SNP, REP, CALC, OUT
- Recommend to have prefix for context columns to distinguish from business columns example: w_system, w_location.
- Have a consistent naming convention for Guzzle jobs, job groups, and end points. Suggest to put prefix like stg_load_xxx, fnd_xxx
- have tags to group the jobs for following: layer in data flow, the business domain , sub-domain. Job type can be one of tags
- Sequence number can be added to the job group
- Enable GIT integration for Non-prod env is must - so that all the changes done in the jobs are tracked automatically
- Recommend to follow gitflow as the GIT workflow (https://datasift.github.io/gitflow/IntroducingGitFlow.html)
- Lock the "develop" and "master" branches
- Since Guzzle env is made up of specific items like Context columns, two separate application which don't align to context columns should go as separate Guzzle instance. Other reasons why two app shoud have separate Guzzle instances are:
- Security requirement - today we don't have granular job level security in Guzzl
- More control on Guzzle binaries and fixes
- Manageability - start/stop of Guzzle services
- If non-production usage needs dedicated env for user testing (where guzzle jobs are freeze) and Development env where user are making changes in Git branches and deploying it on "default" to run scheduled jobs / unit testing in Devt then they have to be separate dout.
- Guzzle allows you to define template jobs which can be used to inherit common configs
- Configs like standard filters to be applied on give source data, populating of framework columns and any generic item should be part of this
- Separate jobs with fixed source and target but inheriting common configs from the master generic config will ensure the data jobs are distinct and lineage gets captured correctly
- Jobs parameter should be used to avoid hardcoding of different jobs aspects - this will allow making the jobs generic
- Usage of job params for source and target table/file name should be avoided if lineage needs to be captured
- Understand the precedence rules of param passed at the time of invocation, specified at the job group level when adding jobs, at Env mapping level and the context param being passed when invoking a batch: https://github.com/ja-guzzle/docs/wikis/Documentation/Guzzle-Parameters
- Default the param values so that the linage as well as run default job runs (without any param override) when doing unit testing run successfully
- Suggestion is to let job group param auto-map to job param
- Keep the param as atomic as possible example: dont have param name called country_system with values SG_ORDER. Better to have two params and use Groovy template to combine them if that is reqiured sometime . Essentially Understand Groovy syntax and use it effectively to resuse the same param for multiple purpose - example use this param and can concact to for country system to
- To achieve correct lineage Atlas parameter set can be used to used to enforce looping thru same job config with multiple param values https://github.com/ja-guzzle/guzzle_common/issues/233
- It is recommended to keep the tables of different layer in respective schema /db. This simplifies overall management and security
- When the tables are referred in pre-sql, post-sql and sql's in processing/ingestion/DQ/Recon, they do expect explicit database name to specified as part of this sql. Instead of hardcoding the database name specified in these SQLs, one can refer to the db name of logical end point (after being resolved to physical end point) using :
${endpoint.lo_guzzle_hive}.<>
- Decide the logical end points correctly as this is crucial on how you see grouping of Guzzle datasets in lineage. Ideally treat each data schema / or db name in hive/ schema in SQL server/ as separate logical end point even tough they are from same databaes and user for this end point is same. Likewise for files have separate end point for each container or directory which logically groups one data domain (example one system) - avoid using top level directory as that will force you to put sub-dir in file name pattern as hardcoded or via param - and such sub-dir might be env specific making job config either not portable across env or adds a additional param to be provided at runtime
- Avoid relying on schema prefix as much possible when specifying source, targer or SQL - this will ensure the jobs are portable across env. you can use end point name 1...
- Job group should do end to end task for each system for one stage of data loading (example a job group should be defined to copy data from files into STG_ table for system xxx)
- Any pre-tasks like copying file from remote machine, calling a remote script generate such files (if its to be orchestrated by Guzzle) and doing any pre-processing of data in files should be done external jobs and part of job group
- While Guzzle supports resume feature, design the job groups and jobs such that entire job group can be re-ran if there are failures - which means ensure the scripts do required cleanup and regenerate the incremental data to be fed into (for scripts triggered on remote system to generate data files) etc
- When working with header-less file, its best to put the column mapping in the source section instead of using transformation section to map col1 -> first_name and so forth. This enables following :
- allows to do the sampling with proper column name
- Ensure transformation section is use for real transformation like concating first name and last name (and not col1 and col2)
- When not using column-mapping in source section, the source data frame will be hvae col1,col2,... and first_name,last_name.. which results in the auto crate to have double the column
- aa
- Its recommended to use Delta over Hive tables - as it optimizes the metadata by keeping it on Json
- Delta supports Time travel, expectation (check constraint)
- Supports streaming and workload and has
- Enforces schema on write
- Handles schema drift quite well
- Processing for Delta tables uses ELT (templates) and will in turn use native Merge and other implementation to move data across
- Ingestion should be used when the data movement is from heterogeneous database. Example: from Delta lake to SQL Warehouse or SQL server
- Data movement across same database can be done via processing
- When building csv -> parquet file -> external table in hive and the loading to agg will not generate the linage in Atlas - Ideally one should do :CSV -> Delta or Hive tales -> Aggregate
- For scenarios like taking incremental data from lake (hive, delta, files on adls) and merging to dim table in SQL Server (or any JDBC target) following options are possible in the order of preference:
- ideally be achieve by keeping the dim table in lake first, merging the data there and moving / mirroring the table to SQL server.
- If the data volume is huge an merge touches small record then the incremental data should be copied over to staging table in SQL Server and then do merge using ETL.
- option (which only works for selected scenario) is to create external table in SQL Server usign the incremental data on lake and then doing merge.
- Ideal to leverage data processing capabilities of Databricks - and hence it better to generate agg in Databricks and copy over (only impacted partitions)
Ideally there will be requirement to maintain reporting cache in the SQL server in some cases on SQL Warehouse.
- Suggestion is to limit this data to : only required tables, use features of reporting tool to navigate to details when required,
- Dimension table should be copied fully (except dated ones), fact table should be loaded incremntally.
- Ensure end to end batch flow covers loading from staging to foundation to aggregate to reporting cache.
- Ideal to leverage the computation
- Use appropriate Spark connector for better performance
- Perform recon to ensure the data ties back
- Ensure there are no data type alignment issues - including number and date types
- Hardcoding of physical env details should be
- The database name is required to be prefixed for most of the modules (processing , recon etc) - one avoid hard coding the same can be picked up from endpoints
- To ensure the same logical end point is used to have correct lineage formed for staging, fnd and other layers - specially when the SQL used in processing has tables from multiple layers
- To ensure you are always working on collab branch or topical branch. The changes once finalized and merged into collab branch can then be published to default branch
- changes done in "default" cannot be pushed to the branches