Pulsar Machine Learning - sedgewickmm18/microk8s-spark-pulsar-etc GitHub Wiki
Pulsar Machine Learning
From Manning's MEAP Apache Pulsar in Action
ML Requirements
Our main focus lies on regression, forecasting as a special case of regression with time-lagged features and anomaly detection. Furthermore we have to deal a high number of individual data points. Imagine a building management system with many temperature, humidity and light sensors:
In this case model inference has to be independently run for each data point for anomaly detection or small groups of data points for multivariate regression. Regardless of whether our goal is a forecast or anomaly detection, we need a couple of sensor readings, a short term history, to do it.
While Pulsar is state aware and supports windowing Pulsar functions operate on a topic. Identifying a data point, for example a temperature sensor, with a topic would result in thousands of topics. While Pulsar supports it it is ineffective. Furthermore sensors might be installed or uninstalled and we don't want to add or remove topics whenever that happens.
Hence we need to isolate data contribution, resp. feature engineering, from inference.
Isolating feature production and consumption: Feature stores
According to this article feature stores are repositories that allows teams to share, discover, and use a highly curated set of features for their machine learning problems. They isolate data preparation from modeling, training and inference
A list of feature stores has been compiled here, an example built on Cassandra is described here. A longer blog about what feature stores are good for can be found here
The following figure shows how consumers, mostly model training and inference, and data produces interact with the feature store
from Eugene Yan's article Feature Stores - A Hierarchy of Needs
Feast as feature store
Since Cassandra is notoriously hard to manage in a production environment, we looked for alternative columnar stores to implement a feature store. The following article Building a Gigascale ML Feature Store with Redis, Binary Serialization, String Hashing, and Compression makes a strong argument for Redis.
I also looked for integration with kubeflow for model training and Openshift support and got attracted by Feast which is backed up by Redis.
This figure taken from the Feast web page summarizes the functionality
The Feast online feature store will be filled by Pulsar functions. Data can be passed through Deequ to check for expected data types, not NULL or similar to ensure data quality.
Pulsar producer functions operate on the feature vectors taken from Feast to
- run model inference and
- fill the historic feature store
ML without feature store - sliding windows
Sliding windows provide an alternative to feature stores with a one-to-one relationship between input/output topics and data sources. The Hands-On section Pulsar ML with sliding windows describes an example.
A very good explanation of sliding windows is found in the blog entryEvent Processing Design Patterns with Pulsar Functions
Actually the window is filled until the specified condition is met, for example when the function has received #window-length-count events, and then the process() methods gets control. This allows to collect enough data before computing an anomaly score over the entire window of timeseries data. The results of the anomaly scoring, window of time series data matching the length of the input window is returned to the output topic.
Note that there is a merging step required to reassemble the anomaly score segments to proper timeseries data.
Online Learning
For smaller number of input topics online learning provides an alternative. Packages like Scikit-Multiflow and its successor River-ML focus on online classification, regression and forecasting* while Yahoo's Vowpal Wabbit has been designed for reinforcement learning.
*Online forecasting with ARIMA models became popular in the last 10 years. See Online ARIMA Algorithms for Time Series Prediction and Online Forecasting and Anomaly Detection Based on the ARIMA Model
For more background see also Machine Learning for Data Streams with Practical Examples in MOA
Pangio-com provided a couple of Pulsar functions, one of them with an online classification example, look for "Naive Bayes" in their github repo.
I've tried Pangio-com's Naive Bayes example and finally got it working, details are found on the Hands On page here
Online learning easily allows to go with a limited amount to topics.
Here is how online and offline learning with Pulsar Functions might come together:
ToDo: figure to explain how to save ~100 events in the context for anomaly detection. In this scenario we'd turn the context into a feature store for unsupervised, non-trainable learning.