G Stream - Spark-clustering-notebook/coliseum GitHub Wiki

Abstract

In recent years, the data stream clustering problem, which requires a process capable of partitioning observations continuously while taking into account restrictions of memory and time, has gained considerable attention in the literature.
Real-time processing means that the ongoing data processing requires a very low response delay. The velocity, which refers to that Big Data are generated at high speed (speed of data in and out), is an important concept in the Big Data domain. Currently, Spark Streaming may be considered as the most widely used streaming platform implementing the micro-batching approach.

The algorithm presented in [1], which is a data stream clustering approach based on the Growing Neural Gas algorithm, uses a stochastic approach to update the prototypes, and it was implemented on a "centralized" platform. In this recent work, we propose MBG-Stream [2], a novel algorithm for discovering clusters of arbitrary shape in an evolving data stream.
The MBG-Stream algorithm is implemented on a distributed streaming platform based on the micro-batching processing model, the Spark Streaming API. In the proposed algorithm, the topological structure is represented by a graph wherein each node represents a cluster, which is a set of "close" data points and neighboring nodes (clusters) are connected by edges. Starting with only two nodes, the graph size is not fixed but may also evolve as several nodes (clusters) are created in each iteration. We use an exponential fading function to reduce the impact of old data whose relevance diminishes over time. For the same reason, links between nodes are also weighted by an exponential function.

The data received in each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval is completed, this dataset is processed via deterministic parallel operations, such as Map and Reduce to produce new datasets representing either program outputs or intermediate states. The input data is split and the master assigns the splits to Map workers. Each worker processes the corresponding input split, generates key/value pairs and writes them to intermediate files (on disk or in memory). The Reduce function is responsible for aggregating information received from Map functions.

References:
[1] Mohammed Ghesmoune, Mustapha Lebbah, Hanene Azzag: Clustering Over Data Streams Based on Growing Neural Gas. PAKDD (2) 2015: 134-145.
[2] Mohammed Ghesmoune, Mustapha Lebbah, Hanene Azzag: Micro-Batching Growing Neural Gas for Clustering Data Streams Using Spark Streaming. INNS Conference on Big Data 2015: 158-166.