IBF Pipeline Code Structure - rodekruis/IBF-system GitHub Wiki

These requirements is consistently modular, reusable, and can be adapted to various specific use case/ hazard type:

Processing workflow: a pipeline should be built with explicit ETL (Extract, Transform, Load) workflow enable modularity between retrieving, processing and storing/sending data. In the IBF context, Transform is renamed to Forecast.
Structure design: object-oriented programming should be adopted to structure a pipeline to organize structurally data and methods that manipulate that data.

Combining both, each process should be a dedicated class that consists of necessary functions supporting that process. Visually, a generic data pipeline is depicted in Figure 1.

Figure 1: Generic IBF data pipeline diagram

TODO: Make this diagram easy editable by re-drawing this image into a draw.io PNG (for example) and store inside this Wiki Repo, instead of copy-pasting the diagram from an internal Miro Board.

Directory structure is recommended as follows, particularly when using Poetry for app packaging. The main pipeline code locates in pipeline-name/pipeline-name and configuration is in config.yaml within the config directory.

pipeline-name 
├── pyproject.toml 
├── README.md 
├── .env 
├── config 
│   └── config.yaml 
├── pipeline-name 
│   ├── __init__.py 
│   ├── data.py 
│   ├── extract.py 
│   ├── forecast.py 
│   ├── load.py 
│   ├── pipeline.py 
│   ├── scenarios.py 
│   ├── secrets.py 
│   └── settings.py 
├── tests 
│   └── __init__.py

Class Extract

Extract performs extracting external raw data of hazard indicator(s) from one or multiple sources by geographical area of interest.

This Extract class consists of multiple methods/functions to serve the data extracting. These methods/functions could vary depends on data type format of the hazard indicator(s). Not matters methods/functions embodied, a main method encapsulate sub extraction processes extract.get_data().

Note that hazard event should be clearly defined from the beginning as in this Extract step as well as following. See this document API for pipelines · rodekruis/IBF-system Wiki for definitions of hazard event per hazard type.

Load class can be initiated within this class to pull data (like administrative divisions) from or send extracted data to IBF relevant storage locations. Data class is also called to structure data model through extraction processes.

Example: In the IBF river flood pipeline, Extract calculates GloFAS river discharge per administrative areas per GloFAS station. The administrative area data is stored in a database that is pulled by Load to the pipeline for the calculation.

Class Forecast (Transform)

For an impact-based forecasting pipelines, Transform can be changed into Forecast to match with IBF data pipeline purposes. Forecast should perform the follows:

checks if the extracted hazard indication data exceeds given thresholds
identifies probability, severity levels of the hazard
calculate extent of the hazard
calculate exposure and vulnerability of the hazard

Similarly to Extract, Forecast class consists of multiple methods/functions to serve the data extracting, of which a main method encapsulate sub transformation processes forecast.compute_forecast(). Consider hazard event definition from Extract above for structuring the analysis in this step.

Load class can also be initiated within this class to pull from or send data to IBF relevant storage location. Data class is also called to structure data model through transformation processes.

Example: In the IBF river flood pipeline, Forecast checks if the extracted GloFAS river discharge per administrative area per GloFAS station exceeds its threshold, identify flood extent scenario and calculate exposure and vulnerability. The threshold(s), exposure and vulnerability data are stored in IBF database and IBF data storage that is pulled by Load to the pipeline for the calculation.

Class Load

Load class is in charge of connecting a pipeline to various IBF database and data storages (supporting resources). It includes methods that can be call to execute downloading and uploading data during the process steps. Since IBF data pipelines share the same set of supporting resources, this class could be reused.

This also includes connections with different API services including sends forecast data to IBF portal via IBF API service.

Consider hazard event definition from Extract and Forecast above for structuring the upload in this step. A complete API call comprises a set of forecast data of one hazard event. There are various requirements of an upload to the IBF API service for every hazard event scenario (see Class Scenarios). Even no event scenario has some requirements. Read more about requirement for data uploading API for pipelines · rodekruis/IBF-system Wiki and consult IBF SW developers for more details.

Class Pipeline

Pipeline is the base class for a data pipeline describing how it can run. It manages order of processes, as well as how the process’ input/output is linked to each other. It calls various components mentioned above like Extract, Forecast, Load to perform the analysis.

A main method pipeline.run_pipeline() should include these requirements. At the same time, it should bring flexible control over each pipeline step through method arguments.

Class Scenarios

Scenarios class models hazard scenarios by getting mock hazard indication data for specified events (e.g., trigger or alert levels). It uses the provided Pipeline to retrieve necessary data, simulate hazard values, and apply these values to stations or administrative levels based on configurable settings and scenario events. List of hazard-specific scenarios should be discussed and agreed with SW developers.

The class should cover relevant hazard-specific scenarios in the IBF system (IBF portal).

Example: In the IBF river flood pipeline, scenarios include:

nothing: 	no detected event 

trigger-on-lead-time: 	triggered event on a specified lead time 

trigger-after-lead-time: 	triggered event after a specified lead time 

trigger-before-lead-time: 	triggered event before a specified lead time 

trigger-multiple-on-lead-time: simultaneously two or more stations trigger on a specific lead time 

alert: 	low/medium event detected (no specific lead time) 

alert-multiple: 	simultaneously multiple low/medium events 

trigger-and-alert: 	simultaneously triggered event and low/medium event  

trigger-and-alert-multiple: 	simultaneously triggered event and multiple low/medium events 

trigger-multiple-and-alert-multiple: simultaneously multiple triggered events and multiple low/medium events

Class Data

Data is where the data model of the pipeline is defined. In this class there are 2 levels of data: DataUnit and DataSet. DataUnit is dedicated for the 2 standard data in an IBF data pipeline : administrative areas and hazard forecast. Finally, we define DataSet, i.e. collection of data units, and PipelineDataSet , i.e. collection of data sets. It also includes methods that can be call to get or upsert data within a data unit during processing.

Example : For the flood pipeline, the base class is either AdminDataUnit (administrative divisions) or StationDataUnit (GloFAS “stations” a.k.a. “reporting points”). For each, we then define a class for river discharge data, a class for forecast data (trigger yes/no, etc.) and a class for trigger thresholds, which inherit from the base class. PipelineDataSet wraps all data units through every step of the pipeline.

Classes Settings and Secrets

Settings and Secrets are responsible for loading settings to a pipeline run from the configuration file (see Configuration section) and customized secret file (see Secrets section). Since IBF data pipelines share the same set of supporting resources, Settings and Secrets classes could be reused through all pipelines.

Detailed configuration and secrets

Configuration is a data serialisation file (.yaml) locating in config directory. The file contains customized settings such as country specific settings such as lead time, alert level definition (thresholds, probabilities), administrative division levels of interest, etc. It also specifies data source URLs, names of storage locations, etc. This configuration should be the only place in a data pipeline where customized setting is stored.

Secrets consist of credentials to supporting resources and to IBF portal. It can be stored as environment variables in .env or similar file. Check in the class Secrets to see what secrets data file format is supported. Secret file should never be included in public commits.

Contact IBF SW developers or AA Data specialist to obtain necessary secrets for development and testing.