ESO Data Processing System Notes - casangi/RADPS GitHub Wiki

ESO Data Processing system (EDPS) is a framework to run ESO's data processing pipelines implemented in Python.

References:

“Adaptive data reduction workflows for astronomy:The ESO data Processing System(EDPS)” Freudling, Zampieri, Coccato et al. 2024 A&A 2024 A&A 681. A93
EDPS workflow design tutorial (more practical guide to how to create an EDPS workflow)

Terminology:

EDPS recipe - Descriptions of processing steps (including algorithms, parameters, and methods) which can be executed independently. Each recipe has specifications of required (main and associated) inputs and outputs.

EDPS task - A specific instance of an executing recipe.

EDPS job = Prefect task and its inputs - almost analogous to a Prefect “run” but without the execution and associated information (?)

Key feature not available in Prefect:

Automatically adapts to different use cases for data reduction (QA, production of science products) and automatically derives workflows for them.

Automatically derives processing workflows for difference use cases from a single specification of a cascade of processing steps (advantage: no need to write and maintain a set of static workflows that need to be modified when observing strategies, pipeline, or calibration plans change. What steps are run and what is processed depends on the target selection.

Running different ‘recipes’ based on different circumstances is something the PL team has expressed would be useful.

Features available in Prefect

‘Smart-re-runs’ If the same task is executed with the same set of parameters and input files, EPDS will skip the processing and use the previously saved result. Prefect can handle this with results cacheing.

Automatically wait until all needed inputs for downstream tasks are present before executing them.

Interesting / worth thinking about:

Their tests test:

The grouping and data association
Structure of the processing cascades So, the goal is to verify that the generated workflow, the tasks triggered and their inputs are as expected. No actual recipes are run.

In interactive mode, the tasks are executed in an order that is easily understood by the user and optimized for interaction at the necessary steps. The order that is most easily understood by the user (to understand the consequences of their interactions) is often not the most efficient for computing resources.

Other notes:

EDPS server gets requests via a REST API. Request info includes: data location, processing cascade spec, a target, and workflow parameters if needed. EDPS derives and executes the data processing workflow in a sense that seems more broad than the way Prefect does.

Practical overview of the EDPA workflow

It consists of:

List of tasks (main workflow) e.g. instrumentname__wkf.py
Datasource file contains the list of the inputs of the various tasks in the workflow. E.g. instrumentname_datasources.py
A file with the classification statements (e.g. instrumentname_classification.py) Contains classfication_rules objects. The classification rules may be in a separate file
A file with rules, functions that allows classification and association of files (instrumentname_rules.py)
A file with the definition of (FITS?) header keywords. (instrumentname_keywords.py)
A yaml file with workflow and task parameters (instrumentname_parameters.yaml)
A file with subworkflows

Example: (based on the examples in EDPS workflow design tutorial)

Main workflow (demo0_wkf.py)

 from edps import task
 from .demo0_datasources import *

 #--- Processing tasks -------------------------------------------------------------------

 #- Task for processing raw biases
 bias_task = (task(’bias’)
 .with_recipe('run_bias')
 .with_main_input(raw_bias)
 .build())

 #- Task for processing raw flats
 flat_task = (task(’flat’)
 .with_recipe('run_flat')
 .with_main_input(raw_flat)
 .with_associated_input(bias_task)
 .build())

 #- Task for processing science exposures
 science_task = (task(’object’)
 .with_recipe('run_science')
 .with_main_input(raw_science)
 .with_associated_input(raw_sky, min_ret=0) # sky is an optional input
 .with_associated_input(bias_task)
 .with_associated_input(flat_task)
 .with_associated_input(static_catalog)
 .build())

demo0_datasources.py

 from edps import data_source

 # --- Raw types datasources --------------------------------------------------------------
 raw_bias = (data_source(’BIAS’)
 .build())

 raw_flat = (data_source(’FLAT’)
 .build())

 raw_science = (data_source(’OBJECT’)
 .build())

 raw_sky = (data_source(’SKY’)
 .build())

 # Catalogue of standard stars
 static_catalog = (data_source("catalog")
 .build())

Discussion Notes

The paper raised the following questions:

What are the requirements for managing workflows? What is the workflow lifecycle for our use cases and for “custom” workflows? How do we build, refactor, and reuse workflows? (There are currently 33 PL recipes.)
For each of the use cases^*, what information is needed for each stage, where does the information come from, and when is it available? How do the answers to these questions affect workflow, stage, and domain library design (including interfaces/contracts between elements)? What does this information tell us about stage sequencing?
Context design is an open question but using a service to make required information available is attractive.

The paper reinforces elements of the current design:

Ensure domain functions are not coupled to infrastructure.
We want a library of domain functions.
We want a web api to launch processing.
Cyclic graphs for processing and Algorithm Architecture are sufficient fundamental concepts.

Other issues raised:

Brian Kirk is investigating how to apply ML for optimal task (stage?) invocation sequencing.
If we were able to auto-sequence stages, we would not need a large set of recipes.
The more stage information available up front, the fewer conditionals (lower complexity) in the codebase.
All workflows will need to be defined to the level of detail provided in RADPS Memo 6 “Example RADPS Workflow Decomposition”.

^*Background: Current use cases currently being developed: Standard Mode Data Reduction (automated), Interactive Workflow (human-assisted), Calibration, Commissioning and New Modes, Operations from Array Operator Perspective, Operations from Software Operations Perspective, Triggered Observations (target of opportunity).