Batch and Resume flag in Guzzle - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Batch

What is Batch in Guzzle?

  1. Batch is ordered run of jobs, which are grouped as stages, with one mandatory parameter which is : Business Date plus additional context parameters (or context columns)
  2. Batches in Guzzle allow to define the data integration flow with strict sequencing of following:
    • Business dates for which data is being processed to prevent accidental runs for future or past dates
    • Stages in data flow (assuming data is processed through multiple layers like staging and foundation for a given source system) to prevent accidental runs for future stages for a given business date till previous stage has completed.
  3. Additionally Guzzle allows to define more tighter control like stage-pair where two stages are to succeed before it goes to next business date
  4. Also Batches allow dependency to be defined across other batches to bring inter batches dependencies - example: ORDER System FND layer should only run when CUST_MASTER FND layer has complete; however ORDER STG layer may continue without any issues
  5. Additional sophistication that batches bring is explicit override to allow stages to run partially if there are failures (applicable for cases where the jobs are deigned to catch up data next day)
  6. Batches also allows rerun of batches or stages for historical dates
    • you will have to mark Batch as "Rerun" if it you want to re-initialize batches for historical dates (either for a rerun or run the dates which got missed out.
    • Similarly when running a particular stage for historical date, "Rerun" is required to be enabled for that stage to ensure that it can run for historical date.
    • This two Rerun settings are independent (one can be set without other)- there will be scenario where you want to run particular stage in batch to be rerun for historical date without re-init or running entire batch (in this cases the stage and batch status will have to be updated from Guzzle repo) and rerun triggered. In batch config you just mark state as Rerun image
  7. Resuming the batches from the point they had failed (applies to only the batches which are run which are configured to run with partial=false) image
  8. Batches have built in capabilities to support catch-up run where it will run all the business dates (or open batches) in ascending order for which a given stage[s] is outstanding (haven't run)
  9. Batches provide the notion of context columns starting form system_name (or batch_name) and other columns like: location, workspace, etc which can be used to by jobs to either organize or partition data or apply specific logic

Why do we need batches?

  1. To have strict dependencies in data flows - whether its stages in the data flow, or cross-data flow dependencies
  2. Flexibility to run batches in parallel from multiple system or even multiple stages from same batch
  3. catch up the historical dates run which are pending

Batch - General Rules

  1. Batch execution has two independent steps.. Batch Init and Batch Stage run. This are really really independent. So never mix them up
  2. WARNING and SUCCESS are treated same when deciding whether next stage should run or not
  3. Rerun of batches and Resume Flag are independent concept though they may appear similar. Rerun feature allows one to re-init the batches for past periods even though system is moved future days. Resume feature allows to revive existing failed stages / job group and continue from the point its failed
  4. A fresh stage_id generated (as explained below) unless its re-triggered with Resume=Y
  5. Business date is the most crucial aspect of the batch and it supports timestamp data type. You can use either business date by putting the time portion as 00:00:00 or include business date with timestamp

Batch Init

  1. Batch init needs mandatory business date as parameter along with context columns
  2. You can do init with specific business date , or range with from and to date. "To Business Date" is always mandatory, while "From Date" is taken as the last available batch which is non Aborted stage.
  3. When doing batch init for range of value the increments are done by date (there is no support for hours, minutes or seconds yet). The number of days that are used for increment depends on below setting: image And you can always override that at time of doing the batch init image
  4. Rerun of historical batches can have undesired impact to the data hence Guzzle enforces additional check
  5. Batch Init will try to avoid creating a record for a business date + context columns if its already in status of the batch is <> ABORTED stage unless following excpetion
    • its latest business date and existing batch is in status : WARNING or SUCCESS
    • its historical business date and batch level rerun is mark as Y(True), the given business date is missing or is with status SUCCESS or WARNING
  6. When doing batch init guzzle runs below query. The intention to identify if the context has moved to future date , if yes then the batch init for historical date will not fail
SELECT COUNT(1) FROM batch_control WHERE system='TEST1' AND location is null AND business_date > CAST('2020-06-27 19:00:00' AS datetime) AND batch_status <> 'ABORTED'
  1. When rerun = N. Assuming there is existing 2020-06-27 20:00:00 batch, the batch init for 2020-06-27 19:00:00 will fail
[INFO] 2020-06-28 04:07:37,624 com.justanalytics.guzzle.orchestration.service.BatchControlService - initializing batch record
[INFO] 2020-06-28 04:07:37,629 com.justanalytics.guzzle.orchestration.service.BatchControlService - checking eligibility criteria for new batch
[INFO] 2020-06-28 04:07:37,632 com.justanalytics.guzzle.orchestration.service.AbstractJdbcBatchControlService - checking is last batch record : 
SELECT COUNT(1) FROM batch_control WHERE system='TEST1' AND location is null AND business_date > CAST('2020-06-27 19:00:00' AS datetime) AND batch_status <> 'ABORTED'
[INFO] 2020-06-28 04:07:37,645 com.justanalytics.guzzle.orchestration.service.AbstractJdbcBatchControlService - is last record? : false
[INFO] 2020-06-28 04:07:37,646 com.justanalytics.guzzle.orchestration.service.BatchControlService - checking rerun batch is allowed ? : false
[INFO] 2020-06-28 04:07:37,646 com.justanalytics.guzzle.orchestration.service.BatchControlService - do not allowed to create new batch
[INFO] 2020-06-28 04:07:37,647 com.justanalytics.guzzle.orchestration.service.BatchControlService - batch in not initialized
[INFO] 2020-06-28 04:07:37,647 com.justanalytics.guzzle.orchestration.BatchInitializer$ - Orchestration application ended                                                                                                                             
  1. When rerun = Y. Assuming there is existing 2020-06-27 20:00:00 batch, the batch init for 2020-06-27 19:00:00 will succeed if there is no batch for 2020-06-27 19:00:00
[INFO] 2020-06-28 04:09:00,761 com.justanalytics.guzzle.orchestration.service.AbstractJdbcBatchControlService - checking is last batch record : 
SELECT COUNT(1) FROM batch_control WHERE system='TEST1' AND location is null AND business_date > CAST('2020-06-27 19:00:00' AS datetime) AND batch_status <> 'ABORTED'
[INFO] 2020-06-28 04:09:00,768 com.justanalytics.guzzle.orchestration.service.AbstractJdbcBatchControlService - is last record? : false
[INFO] 2020-06-28 04:09:00,769 com.justanalytics.guzzle.orchestration.service.BatchControlService - checking rerun batch is allowed ? : true
[INFO] 2020-06-28 04:09:00,774 com.justanalytics.guzzle.orchestration.service.AbstractJdbcBatchControlService - fetching batch records for given business date : 2020-06-27 19:00:00
[INFO] 2020-06-28 04:09:00,775 com.justanalytics.guzzle.orchestration.service.AbstractJdbcBatchControlService - fetching batch record sql : select * from batch_control where system='TEST1' AND location is null AND business_date=CAST('2020-06-27 19:00:00' AS datetime)
[INFO] 2020-06-28 04:09:00,779 com.justanalytics.guzzle.orchestration.service.AbstractJdbcBatchControlService - 0 batch records found for given business date ( 2020-06-27 19:00:00 )
[INFO] 2020-06-28 04:09:00,780 com.justanalytics.guzzle.orchestration.service.BatchControlService - creating batch is being created for the first time
[INFO] 2020-06-28 04:09:00,783 com.justanalytics.guzzle.orchestration.service.BatchControlService - checking for batch restrictions
[INFO] 2020-06-28 04:09:00,784 com.justanalytics.guzzle.orchestration.service.BatchControlService - is batch restriction? : false
[INFO] 2020-06-28 04:09:00,785 com.justanalytics.guzzle.orchestration.service.JdbcBatchControlService - creating batch entry in batch_control table
[INFO] 2020-06-28 04:09:00,798 com.justanalytics.guzzle.orchestration.service.JdbcBatchControlService - batch id : 200628040900798745, batch status : OPEN, business date : 2020-06-27 19:00:00, stages : List(STG), context param : Map(system -> TEST1, location -> null)
[INFO] 2020-06-28 04:09:00,870 com.justanalytics.guzzle.orchestration.service.JdbcBatchControlService - batch entry created with batch_id : 200628040900798745

Running Batches (or running stages)

  1. The focus of stage run is to check all the available batches
  2. Batch run tries to get all the batch runs which are candidate for run (OPEN or FAILED. SUCCESS , WARNING and ABORTED are all done and dusted).
  3. When a stage is defined as Partial=true, Guzzle will proceed to run all the job groups with partial=true. If a job group runs as partial=true, the jobs in that job group will continue to run even if the previous job fails. Atleast one job in each job group that are tagged to stage has to succeed for guzzle to mark the stage as SUCCESS or WARNING.
  4. To get the candidate stages in those batches are FAILED or NOT_STARTED. Guzzle fetches all the batches for a given context column simply using below query and it runs them. Its very much possible that it may get batches of Date D1, D3 and D4 which is not_started/failed and hence it will run them while skipping D2 which might be not initialized or is already in success or warning status
SELECT * FROM batch_control WHERE system='TEST1' AND location is null AND batch_status IN ('OPEN', 'FAILED') ORDER BY business_date

Life cycle when one stage

Sequence Batch status Stage Status
On Init OPEN NOT_STARTED
When running RUNNING RUNNING
When Failed FAILED FAILED
When Success SUCCESS SUCCESS

image

Life cycle when two stage without stage pair

When stage pair is not enabled the batch keeps moving the first stage (in this case STG) irrespective of whether FND has run for that business date.

Sequence Batch status STG Stage FND Status
On Init OPEN NOT_STARTED NOT_STARTED
When triggering on STG alone Looks for open /failed batches with STG either FAILED or NOT_STARTED. It will run all of them without waiting for FND to run
When running RUNNING RUNNING NOT_STARTED
When Failed FAILED FAILED NOT_STARTED
When Success OPEN SUCCESS NOT_STARTED
When triggering on FND alone Looks for open /failed batches with STG either SUCCESS OR WARNING. It will run all of these batches in ascending order
When running RUNNING SUCCESS RUNNING
When Failed FAILED SUCCESS FAILED
When Success SUCCESS SUCCESS SUCCESS
When triggering on STG, FND together It will try to Run STG and FND for each open/failed batches, starting with the smallest business timestamp. If for the first business date STG is already completed it will start with FND and then proceed to next business date
When running RUNNING RUNNING NOT_STARTED
When running RUNNING SUCCESS RUNNING
When Success SUCCESS SUCCESS SUCCESS

**Note: ** When you run STG and FND together, if FND fails it will not proceed with STG of next day (the invocation will come out). Hence if the STG is expected to run for all the days irrespective of FND having issues then trigger STG and FND separately

Life cycle when two stage with stage pair

  1. Stage pairs are meant to ensure Guzzle does not proceed with next business date for the stages that form part of Stage pair till all of these stages for a given business date are SUCCESS or WARNING
  2. This is sometime useful to ensure that STG does not proceed to next day till FND has picked up that data. Common scenario where this has been used are:
    • If STG is truncate insert, we don't want system to replace existing staging data with next days till FND has pickedup
    • Or cases where FND layer is keeping only latest copy of data and there is next stage to do SNP. So till SNP has run the FND should not proceed to next business date
  3. Below table explains when triggering STG,FND for TEST3; where FND was set to failed stopped the batch to proceed
Batch Id Business Date system Batch Status STG Status FND Status
200628092010687000 6/27/2020 22:00 TEST3 OPEN NOT_STARTED NOT_STARTED
200628092003415000 6/27/2020 21:00 TEST3 RUNNING RUNNING NOT_STARTED
200628092010687000 6/27/2020 22:00 TEST3 OPEN NOT_STARTED NOT_STARTED
200628092003415000 6/27/2020 21:00 TEST3 RUNNING SUCCESS RUNNING
200628092010687000 6/27/2020 22:00 TEST3 OPEN NOT_STARTED NOT_STARTED
200628092003415000 6/27/2020 21:00 TEST3 FAILED SUCCESS FAILED

When doing force running of STG for remaining date, the system will fail:

20/06/28 09:26:27 INFO Main$: started work on stage STG
20/06/28 09:26:27 INFO BatchControlService: stage is processed once and current status is 'SUCCESS', required statuses are : [NOT_STARTED, FAILED]
20/06/28 09:26:27 INFO Main$: should run stage validation resulted false for stage : STG
20/06/28 09:26:27 INFO Main$: processing batch record: BatchRecord(BatchID(200628092010687780,Map(system -> TEST3, location -> null),2020-06-27 22:00:00.0),OPEN,Map(STG -> StageInfo(None,NOT_STARTED), FND -> StageInfo(None,NOT_STARTED)))
20/06/28 09:26:27 INFO Main$: started work on stage STG
20/06/28 09:26:27 INFO BatchControlService: previous business date stage status is not valid
20/06/28 09:26:27 INFO Main$: should run stage validation resulted false for stage : STG
20/06/28 09:26:27 INFO JobHeartbeatThread: stopping job heartbeat thread

Latter we fixed FND and reran the batch (with STG and FND) and it started clearing. You would notice that it will clear the FND for the first batch and then start with with STG of next batch

Batch Id Business Date system Batch Status STG Status FND Status
200628092010687000 6/27/2020 22:00 TEST3 OPEN NOT_STARTED NOT_STARTED
200628092003415000 6/27/2020 21:00 TEST3 RUNNING SUCCESS RUNNING
200628092010687000 6/27/2020 22:00 TEST3 RUNNING RUNNING NOT_STARTED
200628092003415000 6/27/2020 21:00 TEST3 SUCCESS SUCCESS SUCCESS
200628092010687000 6/27/2020 22:00 TEST3 RUNNING SUCCESS RUNNING
200628092003415000 6/27/2020 21:00 TEST3 SUCCESS SUCCESS SUCCESS
200628092010687000 6/27/2020 22:00 TEST3 SUCCESS SUCCESS SUCCESS
200628092003415000 6/27/2020 21:00 TEST3 SUCCESS SUCCESS SUCCESS

Batch tries to run as many stages as it can honoring the dependency

  1. We have a batch , three stages . No job pairing. Two batches initialized for same business date with different timestamp. Run the batches with all three stages at one go. Assume third stage has failure. It does not stop it there and go to next days stage 1 as there is no stage pairing and it can continue running image

orchestration_TEST11_201027114229854645.log

  1. This is consistent with below scenario where there is a batch which is in FAILED status. And latter one more batch is initialized. image Latter when running all the stages it does pick stage1 for the next days batch as it first tries to pick first business date first and it may again fail for stage 3, it still moves to next business date and continues. image

Resuming the failed job groups and batches

  1. This feature allows to resume the job group or stages from where it has failed.
  2. Dont' confuse this with resuming a batch - which will rerun a FALIED stages or stages which are NOT_STARTED. But it does run the FAILED batch from the job where it failed last. Hence the resume guzzle.stage.resume/ guzzle.job_group.resume allows to achieve rerun from point it failed. This is useful when rerunning entire job group beginging is expensive
  3. This is achieved via passing the parameter guzzle.stage.resume=Y/guzzle.job_group.resume=Y (for stage or job group). Its passed at runtime (there is no setup at design time). So you can decide to go with resume=Y when need be. Do take note that original and subsequent triggers both should be triggered with resume=Y
  4. It will use the same job group instance id and run starting from the next jobs which has failed
  5. It applies when running the stages in batch - the implementation sits with job group. The batch will cascade the same param to the job group. The stage and job group will use the same instance id. Same applies when a job group is re-triggered Here is the how the stages run when Resume =N image

And here is the one when Resume =Y image

  1. The job group should always run with partial = N (as only the job groups which stop on first failure are to candidate of resume. The one which partial=Y are suppose to continue till end and end with status WARNING if there is any failure)
  2. If a job group has same job added multiple time (with same set of param) only one instance will be triggered (https://github.com/ja-guzzle/guzzle_common/-/issues/379). Ideally there should not be a situation where same job is added multiple times with same param. If that is required, its better to create job clone
  3. Even when running batch stages the same param came by-passed
    1. The resume flag matches every param between the re-triggered and original run.
    2. it will exclude standard stuff like job instance id
⚠️ **GitHub.com Fallback** ⚠️