More about anomaly detection: handling "flatlines" - sedgewickmm18/mmfunctions GitHub Wiki
I'm trying to catch flatlines, areas with a significantly lower rate of changes
Flatlines, spectral analysis, spectral residuals and training a baby CNN
Starting with the Spectral Residual approach
The basic concept of this approach is to remove the boring part of an image in term of the frequency spectrum: most natural picture's log-spectra can be shown to be quite similar
so the idea is to apply Fast Fourier Transformation. Then it computes an image with an averaged spectrum (by applying a convolution kernel on the transformed image), removes this part from the FFT transformation result to get the spectral residual. Applying the inverse FFT we get the saliency map of the original image. Please see here
Saliency Detection: A Spectral Residual Approach
for the full paper.
This approach carries over to time series data. Computing the Spectral residual and applying the inverse FFT to establish the saliency map
so it can be shown that the Saliency map detects the vertical drop anomaly.
Digging deeper into Spectral Analysis to detect flatlines
Starting point is that flatlines should show as areas with exceptionally low activity over the whole spectrum. However, sensitivity of the spectral analysis depends on the window size, i.e. length of the slices. Long slices tend to blur features: this can be clearly shown in the following example of a flatline with 4, 8 or 12 data points and how the various window sizes do in detecting this anomaly
Computing the signal energy transforms the two dimensional spectrogram into a 1-dimensional function to apply typical anomaly detection methods (like z-scoring etc.)
For elliptic envelope we need more than one dimension, so I tend to filter out low, resp. high frequencies and compute signal energies after applying a low-pass, resp. high-pass filter.
The convolutional neural network approach
Suffices to say that this CNN approach is a very quick & very dirty approach.
The first CNN tackles prediction: although it's just a hack without proper data scaling prior to training and no scoring, no systematic accuracy checking, just a few random tests. The second CNN addresses flatline detection: it's as hacky as the first one, with utterly imbalanced data. The only remarkable part is that Dropout (of a quarter of the data) makes model performance significantly worse to the point of non-convergence.
This is the model I ended up with for binary classification
Note that I gave up on rendering it as SVG - see below.
Visualizing and exploring the model
After expanding the notebook to save the Keras model as hdf5 file, Netron
allows for rendering and introspection
Although built for earth science's data, Panoply is well suited to display model architecture
and specific weights
.
Admittedly I couldn't make too much sense out of the conv1d weight tensor plot.
Side comments:
- It took me quite some time to get the dimensions right for the input. You have to reshape your numpy arrays and that's the hardest part - 90% of my google searches were about how to get this right and printing
.shape
became my dear friend. - Keras is very high-level and it's very easy to define the model to your liking. It's equally easy to mistype and get it utterly wrong: seemingly innocuous typos can take hours to get straightend out. I managed to slip the wrong array for labels into the classification model with floats instead of {0,1} binary labels - took some scratching of the head.
- Printing the model graph never worked for me, it only rendered the first half of the model. In the end I resorted to
model.summary()
. - Need to score the results (no f1, no accuracy yet)
- I haven't done any systematic hyperparameter tuning - see also last link below.
In the end I managed to nudge the CNN to convergence and got the following result:
I'd recommend to extend this approach to multi-class anomalies with more balanced data, equally many examples for each class. Maybe Microsoft's approach of computing the Saliency map and then applying a trained CNN makes sense, see also here for more information.
Here a couple of external web pages for convolutional neural networks on timeseries data:
- While the company went downhill their guides on tensorflow and keras are still quite valuable
[3 example CNNs on tensorflor) https://missinglink.ai/guides/tensorflow/building-convolutional-neural-networks-tensorflow-three-examples/) 1D CNN example
-
More on 2d CNNs: more on the conceptual level (this one is in German) 2D CNNs concepts
-
An example on multi-class classification CNN for detecting heartbeat anomalies
Multi-class CNN for heartbeat anomaly detection arxiv article - heartbeat anomaly detection
- The ubiquitous Jason Brownlee on CNN and code examples to get your input data properly reshaped
Jason Brownlee on CNN for binary classification
- ... and as example on binary classification.
Jason Brownlee on CNN for binary classification
- On tuning
[https://stats.stackexchange.com/questions/352036/what-should-i-do-when-my-neural-network-doesnt-learn]
- and this one on a very systematic approach to keras parameter tuning. It describes how to permute through hyperparameter choices and how to visualize the results. I haven't tried this approach yet but will definitively do so before I move any of our CNNs to production.
Hyperparameter optimization with Keras
Some more pages on the background
- My personal favorite: This tutorial starts with computational graphs for tensors and then dives into affine 2d-transformations. Then it introduces the perceptron as binary classifier and how to transform finding a decision boundary into a perceptron with proper matrix and bias vector weights (and a sigmoid as decision boundary function). Next chapter deals introduces loss functions: the problem to find "good weights" for the perceptron is posed as the problem to maximize its weights with regard to the maximum likelihood estimation, or equivalently minimizing with regard to cross-entropy loss. Next two chapters deal with the minimization process: they start wit hgradient descent with regard to matrix and bias vector and extend this approach to multi-layer perceptrons.
- Almost as good as the first reference, again according to my very personal taste.