Anomaly Detection - utkaln/machine-learning GitHub Wiki
- This algorithm follows principle of Normal Distribution aka Gaussian Distribution
- The algorithm identifies data elements that is outside the boundary of high probability thus detecting anomalies
- Useful to detect any new type of defect without seeing the faulty sample data
- It is important to carefully choose features as it is unsupervised, hence choose features to be Gaussian distribution type
- Simple way to remember is to choose the threshold to 5% and consider
p(x) > 5%
as good andp(x) < 5%
is anomaly
Gaussian Distribution (Normal Distribution)
Where
Feature Selection Criteria
- Draw a histogram to find if the feature follows Gaussian Distribution
- If the feature does not appear Gaussian, try to transform to make it Gaussian. Example - try
log(x)
orlog(x+c)
orsqrt(x)
- Python way of drawing quick histogram
plt.hist(np.log(x+0.001), bins=50, color=red)
- If still any data is found under normal distribution but the actual value is anomaly, then try looking for a different feature that can put this dataset outside the normal distribution