Anomaly Detection - utkaln/machine-learning GitHub Wiki

  • This algorithm follows principle of Normal Distribution aka Gaussian Distribution
  • The algorithm identifies data elements that is outside the boundary of high probability thus detecting anomalies
  • Useful to detect any new type of defect without seeing the faulty sample data
  • It is important to carefully choose features as it is unsupervised, hence choose features to be Gaussian distribution type
  • Simple way to remember is to choose the threshold to 5% and consider p(x) > 5% as good and p(x) < 5% is anomaly

Gaussian Distribution (Normal Distribution)

Where

Feature Selection Criteria

  • Draw a histogram to find if the feature follows Gaussian Distribution
  • If the feature does not appear Gaussian, try to transform to make it Gaussian. Example - try log(x) or log(x+c) or sqrt(x)
  • Python way of drawing quick histogram plt.hist(np.log(x+0.001), bins=50, color=red)
  • If still any data is found under normal distribution but the actual value is anomaly, then try looking for a different feature that can put this dataset outside the normal distribution