Guzzle Scheduler - ja-guzzle/guzzle

User Story

https://github.com/ja-guzzle/guzzle_common/-/issues/205

Overview

We move the skip_allowed setting at schedule level and entire schedule is skipped or allowed based onwhether previous cycle is complete or not. The behavior remains same whether its configured in parallel or serial - it has to wait for all of them to complete before next cycle starts. The way it tracks if the current cycle is running or not purely using the actual running cycle in spring-boot (it should not use job info or other tables)
A new setting allow_skip which is only honored for parallel: false which decides whether the serial schedule proceeds to next runnable or not if the current runnable fails. The status of whether the runnable should behave same

User can define multiple schedules in $GUZZLE_HOME/conf/instance/schedules directory

# The file resides in guzzle_home/conf/instance/schedule 
# The schedules are refreshed where each schedule file is refreshed as per latest available one and new schedule takes immediately  and the runnable will start triggering as per the new schedule. 
# While schedule are refrheshed, the existing cycle running for given schedule will continue without any impact
# The record is created for each schdule and the runnable will point to this as parent insance id. 
# For the stage it can overwrite the batch id which is natural parent and may impact monitoring UI and any external reports build on Guzzle table

version: 1
schedule:
  #type: cron
  type: daily 
  #type: daysofweek
  parallel: true ## Irrespective of whether the runnables run in sequence or parallel (if its parallel , all of them run at one shot unless QR prevents it
  allow_skip: true ## If a current cycle (irrespective of paralle of serial) is running and the time for next cycle to starts happens, then it whether it should skip or trigger another parallel cycle
  partial_run: true ## Only if its seiral , whether to continue if there are failures in previous runnable or fail stop this schedule instance. For Parallel this setting can be skipped. Whether the setting is optional or not to be cheeked?
  properties:
    #daysofweek: 1,2,3
    trigger_time: 17:48
    #trigger_every: 1
    #cron: "0 */1 * * * *" 
parameters:
  system: CUSTOMER
  param2: value2 
  #business_date: ${guzzle.scheduler.orderdate} ## Timestamp and can be manpulated using groovy
runnables:
  - id: runnable-id1 ## To check the significance of this item?
    name: jg1
    type: job_group ## This is always at runnable level , change type to new naming covention
    environment: test  ## This is always at runnable level 
    spark_environment: local1 ## This is always at runnable level 
    parameters:
      system:   ## Empty parameters will be prvented from UI, from backend, it may be allowed
      location: SG
      #odate: ${guzzle.scheduler.orderdate} ## Timestamp
      #sdate: ${guzzle.scheduler.systemdate} ## Timestamp
      src_table: table1
      tgt_table: table2
    spark_properties:  ## Only works for yarn and again at runnalbe level 
      sp1: spv1

Name of the schedule file will be name of the schedule. e.g. create file "daily_customers.yml" for schedule named "daily_customers"

One schedule config file will look like following:

version: 1
schedule:
  type: daily/daysofweek/cron
  parallel: true
  properties:
    ...
parameters:
  system: CUSTOMER
  param2: value2
  business_date: ${guzzle.scheduler.systemdate}
  custom_date: ${guzzle.scheduler.systemdate[0..3] + guzzle.scheduler.systemdate[5..6] + guzzle.scheduler.systemdate[8..9]}
runnables:
  - id: 383d9cd8-fc37-4f94-8f32-95cfcc63a62d
    name: job1
    type: job
    environment: dev
    spark_environment: local_spark
    quantity_resource: qr1
    concurrent: false
    parameters:
      location: SG
      param3: ${system}_${location}
    spark_properties:
      ...
  - id: 3fec2f9b-b3a0-4f51-b47b-191ed0e0e091
    name: job_group1
    type: job_group
    environment: test
    spark_environment: hdp_cluster
    quantity_resource: qr2
    parameters:
      location: IN
      param2: new_value2

To schedule daily at 01:00,13:00:

schedule:
  type: daily
  properties:
    trigger_time: 01:00,13:00

To schedule at every 2 hours:

schedule:
  type: daily
  properties:
    trigger_every: 2

To schedule at 01:00 every Monday and Thursday:

schedule:
  type: daysofweek
  properties:
    daysofweek: 1,4
    trigger_time: 01:00

To schedule as per custom cron expression at 9:30 am every Monday, Tuesday, Wednesday, Thursday, and Friday:

schedule:
  type: cron
  properties:
    cron: 0 30 9 ? * MON-FRI

$GUZZLE_HOME/conf/schedules/quantity_resource.yml will be config file to define quantity resources:

version: 1
quantity_resources:
  qr1: 10
  qr2: 5

Few implementation notes-

Implement scheduler features in guzzle api project
We can use spring task scheduler to schedule jobs at some frequency. Check https://www.baeldung.com/spring-task-scheduler for more information
If schedule on the job is updated through UI, it should be updated immediately by comparing in memory references of the schedules using id of the runnable item
There should be scheduled task that is triggered at regular interval (defined by application.syncSchedule.jobScheduler in application.yml of the api project) that reads schedule files and quantity_resource.yml file and updates in memory references of the schedules using id of the runnable item. There should be api that triggers this schedule immediately (for example check api /api/sync)
If value of the qr is increased and if there are waiting runnable items in that qr, those runnable items can start immediately if there is sufficient qr capacity.
If value of the qr is decreased, new runnable items will run according to new qr capacity

Old approach (not relevant)

If parallel=false at schedule level and all scheduled runnable items doesn't finishes for previous schedule, current schedule trigger will be ignored.

in the call we removed this requirement and concurrent running of same runnable item will depend on just allow_skip flag. As per this we can have following situation

we have job j1 which takes 1:30 hours of time, and j2 which takes 1 hour of time for completion. Now these 2 jobs are configured in one hourly schedule where parallel = false and allow skip is true for both j1 and j2 Now At:

00:00 schedule instance 1 - job j1 starts
01:00 schedule instance 2 - job j1 skips at previous one is still running
01:00 schedule instance 2 - job j2 starts
01:30 schedule instance 1 - job j2 skips as job j2 from schedule instance 2 is still running

Earlier assumption was to avoid this situation. Let us know how should be the behavior ?

Guzzle Scheduler - ja-guzzle/guzzle_docs GitHub Wiki

User Story

Overview

Old approach (not relevant)

⚠️ GitHub.com Fallback ⚠️

Guzzle Scheduler - ja-guzzle/guzzle_docs GitHub Wiki

User Story

Overview

Old approach (not relevant)

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️