More about anomaly detection: handling "flatlines" - sedgewickmm18/mmfunctions GitHub Wiki

I'm trying to catch flatlines, areas with a significantly lower rate of changes

Flatlines, spectral analysis, spectral residuals and training a baby CNN

Starting with the Spectral Residual approach

The basic concept of this approach is to remove the boring part of an image in term of the frequency spectrum: most natural picture's log-spectra can be shown to be quite similar

average spectrum

so the idea is to apply Fast Fourier Transformation. Then it computes an image with an averaged spectrum (by applying a convolution kernel on the transformed image), removes this part from the FFT transformation result to get the spectral residual. Applying the inverse FFT we get the saliency map of the original image. Please see here

Saliency Detection: A Spectral Residual Approach

for the full paper.

This approach carries over to time series data. Computing the Spectral residual and applying the inverse FFT to establish the saliency map

Time series saliency map

so it can be shown that the Saliency map detects the vertical drop anomaly.

Digging deeper into Spectral Analysis to detect flatlines

Starting point is that flatlines should show as areas with exceptionally low activity over the whole spectrum. However, sensitivity of the spectral analysis depends on the window size, i.e. length of the slices. Long slices tend to blur features: this can be clearly shown in the following example of a flatline with 4, 8 or 12 data points and how the various window sizes do in detecting this anomaly

Flatline spectrum

Computing the signal energy transforms the two dimensional spectrogram into a 1-dimensional function to apply typical anomaly detection methods (like z-scoring etc.)

Flatline signal energy

For elliptic envelope we need more than one dimension, so I tend to filter out low, resp. high frequencies and compute signal energies after applying a low-pass, resp. high-pass filter.

The convolutional neural network approach

Suffices to say that this CNN approach is a very quick & very dirty approach.

The first CNN tackles prediction: although it's just a hack without proper data scaling prior to training and no scoring, no systematic accuracy checking, just a few random tests. The second CNN addresses flatline detection: it's as hacky as the first one, with utterly imbalanced data. The only remarkable part is that Dropout (of a quarter of the data) makes model performance significantly worse to the point of non-convergence.

This is the model I ended up with for binary classification

binary classification model

Note that I gave up on rendering it as SVG - see below.

Visualizing and exploring the model

After expanding the notebook to save the Keras model as hdf5 file, Netron allows for rendering and introspection

Netron

Although built for earth science's data, Panoply is well suited to display model architecture

Panoply1

and specific weights

Panoply2 .

Admittedly I couldn't make too much sense out of the conv1d weight tensor plot.

Side comments:

It took me quite some time to get the dimensions right for the input. You have to reshape your numpy arrays and that's the hardest part - 90% of my google searches were about how to get this right and printing .shape became my dear friend.
Keras is very high-level and it's very easy to define the model to your liking. It's equally easy to mistype and get it utterly wrong: seemingly innocuous typos can take hours to get straightend out. I managed to slip the wrong array for labels into the classification model with floats instead of {0,1} binary labels - took some scratching of the head.
Printing the model graph never worked for me, it only rendered the first half of the model. In the end I resorted to model.summary().
Need to score the results (no f1, no accuracy yet)
I haven't done any systematic hyperparameter tuning - see also last link below.

In the end I managed to nudge the CNN to convergence and got the following result:

CNN flatline detector

I'd recommend to extend this approach to multi-class anomalies with more balanced data, equally many examples for each class. Maybe Microsoft's approach of computing the Saliency map and then applying a trained CNN makes sense, see also here for more information.

Here a couple of external web pages for convolutional neural networks on timeseries data:

While the company went downhill their guides on tensorflow and keras are still quite valuable

[3 example CNNs on tensorflor) https://missinglink.ai/guides/tensorflow/building-convolutional-neural-networks-tensorflow-three-examples/) 1D CNN example

More on 2d CNNs: more on the conceptual level (this one is in German) 2D CNNs concepts
An example on multi-class classification CNN for detecting heartbeat anomalies

Multi-class CNN for heartbeat anomaly detection arxiv article - heartbeat anomaly detection

The ubiquitous Jason Brownlee on CNN and code examples to get your input data properly reshaped

Jason Brownlee on CNN for binary classification

... and as example on binary classification.

Jason Brownlee on CNN for binary classification

On tuning

[https://stats.stackexchange.com/questions/352036/what-should-i-do-when-my-neural-network-doesnt-learn]

and this one on a very systematic approach to keras parameter tuning. It describes how to permute through hyperparameter choices and how to visualize the results. I haven't tried this approach yet but will definitively do so before I move any of our CNNs to production.

Hyperparameter optimization with Keras

Some more pages on the background

My personal favorite: This tutorial starts with computational graphs for tensors and then dives into affine 2d-transformations. Then it introduces the perceptron as binary classifier and how to transform finding a decision boundary into a perceptron with proper matrix and bias vector weights (and a sigmoid as decision boundary function). Next chapter deals introduces loss functions: the problem to find "good weights" for the perceptron is posed as the problem to maximize its weights with regard to the maximum likelihood estimation, or equivalently minimizing with regard to cross-entropy loss. Next two chapters deal with the minimization process: they start wit hgradient descent with regard to matrix and bias vector and extend this approach to multi-layer perceptrons.

Deep learning from scratch

Almost as good as the first reference, again according to my very personal taste.

Mathematical background of deep learning