JSON Configuration File - noetl/noetl GitHub Wiki

Proposed logic and format of JSON config file to support forks that will run steps in parallel.

The new layout of the JSON configuration file will support jobs, tasks, steps, and cursors. Take a pipeline, for example, to be a job. There might be multiple pipelines to run, each for different projects, so these will be different jobs that will execute in a certain sequence. Each pipeline job is made up of a sequence of tasks, which serve as a component in step control. Tasks contain a group of steps, all of which are ordered in either the same or different sequences. The starting steps to each of the sequences is listed under a task's TASK_FORK, which indicates the sequences the task should execute in parallel, and where to begin each sequence. On the level below steps is the cursor, which is a sequence, such as a range of dates or a set of regions. The step needs to run for each element in the cursor set.

The figure below shows the flow of execution for a single job.

Levels of Interaction:

  • Jobs: Jobs are the highest level in the hierarchy of executions in the program. Tasks, steps, and cursors are all building blocks to create jobs. Each job has a separate configuration file. Only one job can execute at a time, and a job contains a sequence of tasks.
  • Tasks: Tasks are the master control of steps. Each task can either start with a single step, or a "fork", which is an array of steps (each of which is the starting point of it's respective sequence of steps) to run in parallel. Each task is a container for related steps, which may fork and merge within the task, ultimately being joined at the end of the task. The task has a "next" field which references the next task in the sequence in the event of a success or a failure of the current task. The next task is called after the current list of steps has been forked and merged together after all their sequences have completed. The current task points to the next task as a means of directing the program to the next appropriate task after the task has completed. The first task is always "start" and the last task is always "exit".
  • Branches: Branches are an entity that is specific to NOETL3. It is a way of grouping steps, and controlling forks and merges. Branches group sequences of steps, and keep track of the current step, as well as the last step in the branch and step failures. The figure to the right shows the new structure of the green branch after a failure at Step 3. Step 3's nextFail is Step3a. Step3a and it's next step (Step3b) are added to the branch. Step3a is set as the branch's current step, so when the branch runs again, it executes the failure portion of the branch before returning to the original steps in the branch.
  • Steps: Steps can fork other steps, merge steps, or simply reference the next step in the sequence. Each step has a command list to execute, upon which the success or failure of the step depends. The last step in each sequence will point "exit".
    • Success/Failure Logic for Steps: (Figure 2 and Figure 3 demonstrate an example of the Success/Failure logic)
      • If a step is successful, the following step in the sequence under SUCCESS is executed. If the step is the last in it's sequence, the SUCCESS field in the configuration file should be left empty, indicating that the sequence has successfully completed. The step sequence will be marked as a success, and the program will wait for the other sequences running in parallel to end.
      • Under FAILURE, each step has MAX_FAILURES and NEXT_STEP fields. If a step fails, it will continue calling itself while the number of failures < MAX_FAILURES.
        • Once the step has hit the number of MAX_FAILURES, it is redirected to the NEXT_STEP under FAILURES.
          • NEXT_STEP can point to another step in the sequence (indicating that the step isn't integral to the completion of the step sequence in the task). In this case, the step sequence will continue, and the sequence won't be marked as failed. However, if another step in the sequence fails and directs the program to the "failure step" (discussed in the bullet point below), NoETL should perform a traceback, and restart the program at the 1st step that failed in the sequence after the error has been corrected.
          • NEXT_STEP can also direct the program to a "failure step". This "failure step" handles where to go in the case of a failure. It halts it's sequence at the current step, and the step is marked as failed, and the program waits until the step sequences running in parallel have completed. The program is then directed to the appropriate "failure task", which, in turn, directs the program to an exit, or it opens up an error/bug correcting page.
        • One of the core aspects of NoETL's functionality is task management, especially in the case of a failure. Rather than having to run each task manually, and start the entire program over each time a step fails, the goal is to be able to take a snapshot of the configuration when the program hits a failure. Both the task that failed, and the individual step that caused the task to fail, are marked by NoETL, so it knows where to pick up. When all the step sequences end, the program indicates that it's encountered a failure, and brings up logs and other resources to help error correct. When the program runs again, the "start" task will direct the program to start at the failed task, and the TASK_FORK (list of the steps which start the step sequence) under the task will remove any successful sequences, and will contain the step that failed in each of the sequences.

Cursor: The cursor currently contains a range of dates of the data that needs to be processed for each step. The number of cursor dates processed in parallel is determined by the number of threads. The cursor can be any sequence, such as different regions.

In the example above, Figure 2 shows a job that runs Task 1, starting the sequences at stepA1, stepB1, and stepC1.

  • Sequence A is successful.
  • Sequence B encounters a failure at stepB2, but NEXT_STEP routes the sequence to stepB3, which is successful, so the program marks sequence B as a success.
  • Sequence C fails at stepC1, which routes the program to stepD1 in the case of a failure. But stepD1 fails too, and NEXT_STEP is the failure step, which waits for the other sequences to end. The program traces the failure back to the first failed step, stepC1, and marks that step as a failure, as well as marking Task 1 as failed.

The program halts, opens an bug correcting window, and helps the user walk through the process of fixing any errors.

Figure 3 illustrates when the program starts again. The program resumes at the failed task and steps.

  • Task 1 failed, so the job starts at Task 1.
  • It only runs the sequences that failed, starting at the first failed step. Sequences A and B succeeded, and sequence C first failed at stepC1, so only stepC1 remains on the TASK_FORK step list. Sequence C completes this time, and the job can move on to execute the next task.

WORKFLOW.TASKS Details: Task and Step Relationship

TASKS: dictionary mapping task name to task details

  • DESC: task description
  • START: dictionary that maps the joining step to the steps that will be forked; if it just maps one step to one step, no steps will be forked
  • STEPS: dictionary of steps associated with the current task
    • DESC: step description
    • NEXT: instructs program what to do after the step completes or fails - SUCCESS: dictionary that maps the joining step to the steps that will be forked; if it just maps one step to one step, no steps will be forked - FAILURE: how to handle a failure - NEXT_STEP: where to go in the case of a step failure; if empty, step marked as failed - MAX_FAILURES: default = 1; maximum # of times the step can fail before stopping and being marked as failed; step will run again (if it failed) until it reaches the maximum number - WAITTIME: time to wait after a failure before re-running the step
    • CALL: - ACTION: runShell; function called when step runs - THREAD: number of cursor dates to run in parallel for the current step - CURSOR: range of dates of the data that needs to be processed - RANGE: list of the range(s) of dates - DATATYPE: date, integer - FORMAT: date format of cursor (for example, "%Y%m") - INCREMENT: increment for cursor range; append a "y", "m", or "d" for dates (year, month, day) - INHERIT: Mark True if step is to inherit the cursor range from its previous step (mark True for nextFail steps). - CMDLIST: list of commands to be executed; formatted as a list of lists, where each list is a command
    • NEXT: instructs program what to do after the current process - SUCCESS: next task to call after joining steps in fork - FAILURE: exit