Process for ingesting prediction inputs in Semaphore: Overview and Pitfalls - conrad-blucher-institute/semaphore GitHub Wiki

What is input ingestion?

A model run in Semaphore consists of obtaining and formatting input parameters for a model, invoking Tensor Flow (or other engine) with a given weight file (e.g., h5 for Tensor Flow) and the formatted set of input parameters, and recording the output of the run. Most models will require some inputs to be prediction data. Prediction data have a reference time (or prediction time), lead time, and verification time.

  • Reference time/ prediction time: this is the date and time at which the prediction is made.
  • Lead time: this is the interval of time between the reference time and the time for which the prediction is made. For example, a lead time of 6 hours means that the prediction is for 6 hours from the reference time.
  • Verification time: this is the time for which the prediction is made. It is called this way because it is the time at which we can verify the prediction (when we reach this time, we can take a measurement and verify whether the prediction was correct).

Obviously, if we know 2 of these time variables, we can compute the third one: Verification Time = Reference Time + Lead Time.

How does it work?

Semaphore gets prediction input data mostly from external data sources such as LightHouse or NDFD - some models use predictions from other models, which are considered outputs in Semaphore, but for now we will keep this simple. Each model specifies the inputs it needs in its associated dspec configuration file. This involves specifying the location for the data, the data source, the variable type (i.e, the data series such as predicted air temperature or wind speed measurement) and the date and time range and interval needed (for example, predictions every hour from 1 to 12 hours lead time). The configuration in the dspec file looks something like this:

{ "_name": "Predicted Air Temp", "location": "SBirdIsland", "source": "NDFD_API", "series": "pAirTemp", "unit": "celsius", "interval": 3600, "range": [ 12, 1 ], "outKey": "south-bird-island_predicted_air-temp_12", "dataIntegrityCall": { "call": "PandasInterpolation", "args": { "limit": "21600", "method": "linear", "limit_area": "inside" } } }

The process by which this configuration is used to get and ingest data is somewhat complicated but goes something like this:

When are calls to external sources triggered and what happens to the results?

Potential Issues