DSPEC - conrad-blucher-institute/semaphore GitHub Wiki

DSPEC Guide

DSPEC stands for Data SPECification. It fully describes the data pipeline for a given model. Any information about a model (excluding architecture, weights, and structure) should be contained within the given models DSPEC. Semaphore is designed to ingest DSPECs to run a model. Over Semaphore development there has been more than one specification, because these specifications are backwards compatible only the latest will be detailed here.

DSPEC structure

A DSPEC is a json file separated into a few different sections.

Metadata
Timing Info
Output Info
Dependent Series
Post Process Call
Vector Order

MetaData

The metadata section is at the beginning of a DSPEC. It is used to detail information about the file itself, what the model is and where to find the H5 file.

dspecVersion: Details what version of DSPEC this is. It is used by semaphore to call the correct parser to process the file.
modelName: The name of the model this DSPEC is about.
modelVersion: The version of this model major.minor.bug
author: This is not guaranteed to be the author of the model, but it is the point of contact for the model at the time of the DSPECs creation.
modelFileName: This is the path to the H5 file.

Timing Info

Timing info is used by the scheduler to determine when/if the model should be run.

active: (bool) This is a kill switch for a given model it tells the scheduler to schedule it or not.
offset: (int) An offset in seconds off of the interval to run the model. This is used to run a model 20 minutes off the hour for example. This is used to give dependent APIs time to update before semaphore calls on them.
interval: (int) An interval in seconds to determine how often to run the model. (Ex. 3600 would mean to run the model every hour)

Output Info

This is used by Semaphore to handle the output of a given model.

outputMethod: This is the key to a Output Handler class that will post-process and package a prediction.
leadtime: This is the leadtime in seconds for the models prediction. (NOTE:: This is also used by the scheduler to choose which model is run first prioritizing longer lead time models.)
series: A name for the prediction. (ex: pWaterTmp) (NOTE:: Series should start with p if they are predictions or d if they are actuals.)
location: A location keyword for this prediction.
unit: The unit this prediction is stored as, should always be metric.
datum: (optional) The datum of the data, only required for things that have datums. (NA is used otherwise)

Dependent Series

Dependent series is an array of dependent series. It explains to semaphore what raw data needs to be found and where to find it.

_name: (optional) (metadata) A name for the series it just makes it easier to read, its not used for anything.
location: A location keyword for this series.
source: The data source keyword for the data ingestion source.
series: The name of the series needing to be ingested. (NOTE:: Series should start with p if they are predictions or d if they are actuals.)
unit: The unit this prediction is stored as, should always be metric.
datum: (optional) The datum of the data, only required for things that have datums. (NA is used otherwise)
interval: (int) How long, in seconds between each data point within the requested time. (Ex. hourly data = 3600, six minuet data = 360)
range: [int, int] The range is used to determine the time range in which to request the data. When semaphore is called it will take the reference time, interval, and range, and build a toTime and fromTime, requesting data between that.
- toTime = refTime + (range[0] * interval)
- fromTime = refTime + (range[1] * interval)
outKey: This is a key for this series. It will be used by post processing and vector order to select this data.
dataIntegrityCall: (optional) This provides required information to invoke data integrity. Here any rules for data cleaning would be written. (See data Integrity section below)

Post Process Call (optional)

A post process call calls a post process class which will take dependent series, altering them and generating new series. Optional, but the section is still required in the DSPEC, it would just be an empty list. Required information is dependent on what post process class you are invoking but generally they will look something like this:

call: The keyword reference for a post process class. (This is the name of the class file)
args: Arguments to pass to the class.
- <--- (Key : value) Key value pairs of argument names and values. (Ex. "offst": -20)
- <--- (key_name : dependantSeries_outKey) Key value pairs mapping data as inputs to the class. (Ex. "targetDirection_inKey":"VK_WDIR_25")
- <--- (key_name : dependantSeries_outKey) Key value pairs mapping a keyword for the result of the post processing class. The class wont edit the input data, instead it will always export the altered data as its own series. Here you give that output its own key so it can be referenced in other post processing class or in vectorOrder. (Ex. "x_comp_outKey": "x_VK_WNDCMP_25",)

Vector Order

The vector order section will construct the input vector to be fed into the model. This means referencing dependent or post processed data, casting that data to the correct type, and indexing the right amount of that data.

key: The out key of the dependent series data or post process data.
dType: The data type to cast the data to. Before this the data will always be a string. (Ex. float)
indexes: (optional) [int, int] This lets you index a range of data from the series. This is optional, and if not provided the whole of the data will be appended to the input vector.

Data Integrity Call

A data integrity call chooses which data integrity class to invoke and any arguments it needs. Required information is dependent on what data integrity class you are invoking but generally they will look something like this:

call: The keyword reference for a data integrity class. (This is the name of the class file)
args: Arguments to pass to the class.
- <--- (Key : value) Key value pairs of argument names and values. (Ex. "method": "linear")