Rethink Batch Design in context of ADF - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Background

A typical data integration projects has to deal with following concerns around batch and data integration. Below are the area which we see contending with with what ADF and Guzzle offers.

  1. **Connectivity **(Any to Any data movement) - specifically on-premise to cloud (crucial, and other azure sources like O365)
  2. Orchestration - dependencies management: Do we honor context : both options : dates/loop/catching; parallel runs/throttling; dependency (includes error handing ); partial/resume (sometimes this happen)
  3. **Scheduling **- event based, time triggered : ADF Trigger thru shell pipeline calling guzzle batch/control M / Automic/ Autosys/crontab / windows batch)
  4. Run time /Audit monitoring - We provide monitoring UI to see what is running
  5. Notification-

The ones in bold are candidates for ADF while the rest is Guzzle - assuming we treat as Guzzle managing the orchestration and hence run-time audit and monitoring. Whilst Notification still remains a combine topic as it schedules are run in ADF while the actual meat of what has run remains with Guzzle.

Guzzle orchestration capabilities

  1. Considerable depth has been put in the Guzzle orchestration capabilities
  2. The construct of job group gives lot of flexibility to run jobs in parallel with a certain degree of parallelism, auto generate dependency from SQL or explicitly specified as source and target datasets.
  3. Along with job group, the batches brings more capabilities of business date, context parameter , stages, catch up , dependency management across batches

Proposed Approach

Typical flow:

  1. One premise-> landing -> SRI -> FND -> CALC/USECASE -> Reporitn cahce -> PBIX files (if any)
  2. Custom jobs can be for any of them stages (this can be ADF pipeline to on premise, DB notebooks, shell scripts, java programs)

Scheduling using ADF

  • We have one master pipeline for a scheduled triggered/main : its Main() in C
  • It will have few activities to call Guzzle API and simulate sync call via loop and wait for it to finish. It init and trigger guzzle stages. You get "run id" which when you submit so that the same gets used to poll status
  • Status is FAILED if any of the stages/context/ dates failed; the "run id" can be used to resolve the stages that it ran
  • It will call Batch Run API of Guzzle and pass the context and stage list (ALL Stages by default) (to be added in Guzzle)
  • It generates notification on completion for the master pipeline

Other items

  1. Running logs are important - it is Log4j behavior on blob, is it DB (when it simply spits content to a file in blob thru normal notebook)
  2. Onetime setup to include: stages and context columns needs to emphasized in documentation

Why we need stages

  1. One its complex to manage table level dependency ; better to manage dependencies at stages
  2. And you can't do straight thru processing as you will have some FND table and some STG

Calling External jobs in guzzle from job groups

  • so that all the non guzzle native jobs are handled thru it. We support four end points and UI changes as per them (JDBC for stored prod, shell, ADF pipeline and Databricks (for databricks once are similar to the Local shell end point, no password, nothing, it just runs from the Compute env - and hence the account used by compute is leveraged to get workbooks.
  • External job should be able to do sync (start/stop/status)
  • Logs in running log file for this jobs - as much the stub gets. We put links to retrieve further logs from ADF/Databricks if some one wants to know more about it. We don't retrieve them in
  • Killing of the jobs should be support for all the external job types - using best of the ability we can
  • Runtime audit : Start/stop/status is captured by guzzle: Assuming guzzle is orchestrating or calling it
  • There is deeper Audit (We get workunit/monitoring (runtime audits) for ADF similar to ADF sync)
  • Lineage support shall remain till what guzzle supports in terms of end point (example guzzle does not support lineage for Cloud apps APIs - some one can put them as files in ADLS and link the lineage to that in external jobs)

Other pattern - ADF is the master orchestrate

  1. Its NOT recommend unless projects does not have complexity and wants to use guzzle for selective purpoes
  2. ADF invokes job groups in ADF and manages rest of the jobs in ADF
  3. Handles all what is in #Background

Other Topics

Why context in Guzzle

  1. Business date: only relevant when you End of business; end of day snapshot; downstream processing goes by days, ageing computation done for each date etc
  2. Rename system as "batch_name" and make it as only context column we make it default. One can extend it if need be. Jobs called for specific batch. We loose on classic hierarchy resolution of batch we support
  3. Look at current construct of batches will fall in place
  4. Complex dependencies between stages is important - where converge the data from multiple system (either join them side by side) or stacking up and then aggregating
  5. Other usage of context columns: this helps used by for:
  • All the audit tables
  • All the recon/ DQ
  • All the house keep may very much leverage it..

Job Group and Auto dependency

  1. Today dependencies are within the job group
  2. We run partial = false always ; so that it runs how far it can - either thru "audo-dependency" or sequential (depending on how jobs are run)
  3. Next stage wont' run till we clear
  4. We can setup the stages to run always in resume=Y (by setting that param in ADF API calls) so that it runs the failed stages from the table / job it failed (not the record it failed :) )
  5. If some really wants to force run he has to force run it with resume = N
  6. Play with parallelism to run things faster
  7. Auto dependencies in guzzle still in testing phase
  8. Expectation for someone will be to know what all job group will run - which means we need pre-crate all the jobs that needs to run with the order (dependency) - so perist the dag into job_info so that we can plot the graph in UI
⚠️ **GitHub.com Fallback** ⚠️