Clustream: Online Micro cluster Maintenance - MelissaMifsud/powerlog GitHub Wiki

The micro-clustering phase is the online statistical data collection portion of the algorithm. In powerlog, the online phase will be run on multiple sites.

Initialisation

The Clustream algorithm specifies that the initial micro-clusters are created by an offline process using a standard k-means clustering algorithm on the first InitNumber data points. Since this is secondary to the scope of this project, the clustering methods in Apache Commons Math3 library will be used. The default method to use will be KMeans++ with consideration to make the choice configurable.

[1] E.W. Forgy (1965). "Cluster analysis of multivariate data: efficiency versus interpretability of classifications". Biometrics 21: 768–769