Mean shift LSH (DNNMS LSH) - Spark-clustering-notebook/coliseum GitHub Wiki

Distributed Nearest Neighbours Mean Shift with Locality Sensitive Hashing DNNMS-LSH

This algorithm was created during an internship at Computer Science Laboratory (Laboratoire d'Informatique de Paris Nord, LIPN) at the University of Paris 13, with Lebbah Mustapha, Duong Tarn, Azzag Hanene and Beck Gaël. Its purpose is to provide an efficient distributed implementation to cluster large multivariate multidimensional data sets (Big Data) Nearest neighbor mean shift (NNMS) defines clusters in terms of locally density regions in the data density. The main advantages of NNMS are that it can detect automatically the number of clusters in the data set and detect non-ellipsoidal clusters, in contrast to k-means clustering. Exact nearest neighbors calculations in the standard NNMS prevent from being used on Big Data so we introduce approximate nearest neighbors via Locality Sensitive Hashing (LSH), which are based on random scalar projections of the data. To further improve the scalability, we implement NNMS-LSH in the distributed Spark/Scala ecosystem.

More details, you can see https://github.com/Spark-clustering-notebook/Mean-Shift-LSH