Pulsar Machine Learning - sedgewickmm18/microk8s-spark-pulsar-etc GitHub Wiki

Pulsar Machine Learning

From Manning's MEAP Apache Pulsar in Action

ML Requirements

Our main focus lies on regression, forecasting as a special case of regression with time-lagged features and anomaly detection. Furthermore we have to deal a high number of individual data points. Imagine a building management system with many temperature, humidity and light sensors:

In this case model inference has to be independently run for each data point for anomaly detection or small groups of data points for multivariate regression. Regardless of whether our goal is a forecast or anomaly detection, we need a couple of sensor readings, a short term history, to do it.

While Pulsar is state aware and supports windowing Pulsar functions operate on a topic. Identifying a data point, for example a temperature sensor, with a topic would result in thousands of topics. While Pulsar supports it it is ineffective. Furthermore sensors might be installed or uninstalled and we don't want to add or remove topics whenever that happens.

Hence we need to isolate data contribution, resp. feature engineering, from inference.

Isolating feature production and consumption: Feature stores

According to this article feature stores are repositories that allows teams to share, discover, and use a highly curated set of features for their machine learning problems. They isolate data preparation from modeling, training and inference

A list of feature stores has been compiled here, an example built on Cassandra is described here. A longer blog about what feature stores are good for can be found here

The following figure shows how consumers, mostly model training and inference, and data produces interact with the feature store

from Eugene Yan's article Feature Stores - A Hierarchy of Needs

Feast as feature store

Since Cassandra is notoriously hard to manage in a production environment, we looked for alternative columnar stores to implement a feature store. The following article Building a Gigascale ML Feature Store with Redis, Binary Serialization, String Hashing, and Compression makes a strong argument for Redis.

I also looked for integration with kubeflow for model training and Openshift support and got attracted by Feast which is backed up by Redis.

This figure taken from the Feast web page summarizes the functionality

The Feast online feature store will be filled by Pulsar functions. Data can be passed through Deequ to check for expected data types, not NULL or similar to ensure data quality.

Pulsar producer functions operate on the feature vectors taken from Feast to

run model inference and
fill the historic feature store

ML without feature store - sliding windows

Sliding windows provide an alternative to feature stores with a one-to-one relationship between input/output topics and data sources. The Hands-On section Pulsar ML with sliding windows describes an example.

A very good explanation of sliding windows is found in the blog entryEvent Processing Design Patterns with Pulsar Functions

Actually the window is filled until the specified condition is met, for example when the function has received #window-length-count events, and then the process() methods gets control. This allows to collect enough data before computing an anomaly score over the entire window of timeseries data. The results of the anomaly scoring, window of time series data matching the length of the input window is returned to the output topic.

Note that there is a merging step required to reassemble the anomaly score segments to proper timeseries data.

Online Learning

For smaller number of input topics online learning provides an alternative. Packages like Scikit-Multiflow and its successor River-ML focus on online classification, regression and forecasting* while Yahoo's Vowpal Wabbit has been designed for reinforcement learning.

*Online forecasting with ARIMA models became popular in the last 10 years. See Online ARIMA Algorithms for Time Series Prediction and Online Forecasting and Anomaly Detection Based on the ARIMA Model

For more background see also Machine Learning for Data Streams with Practical Examples in MOA

Pangio-com provided a couple of Pulsar functions, one of them with an online classification example, look for "Naive Bayes" in their github repo.

I've tried Pangio-com's Naive Bayes example and finally got it working, details are found on the Hands On page here

Online learning easily allows to go with a limited amount to topics.

Here is how online and offline learning with Pulsar Functions might come together:

ToDo: figure to explain how to save ~100 events in the context for anomaly detection. In this scenario we'd turn the context into a feature store for unsupervised, non-trainable learning.