Grub Overview - conrad-blucher-institute/semaphore GitHub Wiki
The name grub is a play on word between 'grab' and 'grib' (the NWS data file format). Grub was first created as a set of utility classes to help in dealing with grib2 files in general such as downloading them and extracting data from them.
Grub is a component of the Semaphore platform. It's purpose is to insulate Semaphore Core from the lack of reliability and the variability of external data sources such as NDFD, Tides & Currents, and Lighthouse among others.
#What problem does Grub address
Semaphore core was designed and implemented to run AI models and both the design and the implementation focus on the following:
- Gather and organize input parameters for consumption by a model
- Execute the model
- Store the generated prediction
There are 2 implications that stem from that original focus. First, we originally implicitly assumed that reliable sources for input variable data would be available on-demand and second, we design Semaphore in such a way that it focuses on a single model at a time, independently of any other model that it is running.
These 2 implications had led to 2 significant issues in Semaphore:
- Data sources have huge variability in when and how to make data available and in what format. In addition, some data sources are not very reliable requiring hacks and work-arounds in Semaphore.
- Semaphore does not look at input data requirements across models, causing it to make repeated calls to data sources to get the same data, hereby increasing the load on and the lack of stability of some of the external data source APIs it uses (e.g., NDFD API).
Trying to address these issues appropriately within Semaphore would be taking away resources that could be dedicated to supporting more complex predication and would increase significantly the complexity of Semaphore making it harder to onboard new students and maintain the platform.
The goal of Grub is to address these 2 issues by doing the following things:
- Insulate the complexity of getting and manipulating data and updates from various data sources by (1) internally have the logic of when and how to obtain that data and (2) present the latest available data in a standardized format regardless of the data source.
- Adopt a data source-centric view of the data instead of a model-centric view of the data.
Grub's Responsibility
- Provide a standardized and stable public interface to get weather data from various data sources
- Obfuscate the tribulations and complexity of getting data from these data sources such as publishing schedule, unavailable lead times, variability of formats, etc.
- Simplify using best practices as it regards to using weather data. For example, always present predictions data for various lead times as a unit identified by reference time.
- Always provide the latest available prediction data. Grub is not responsible for keeping historical prediction data although. Grub should always present the most recent prediction data available. This simplifies both Grub's design and its public interface and guarantees to Grub users that the data is always as fresh as possible.
- (Possibly - future) Support getting predictions for a location regardless of data source (so in this case, Grub would get the most recent predictions for a location across all available data sources).
High Level Design
Future enhancement
In the future, Grub could make it possible for systems like Semaphore to trigger model runs based on availability of new data instead of on schedule.In this scenario, scheduling would be left completely up to Grub to refresh its data.