icp7m2 - gracesyl/big-data-hadoop GitHub Wiki

Analysis of machine learning in MLlib in pyspark as follows:

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, decision tree algorithm can be used for solving regression and classification problems too.

Random forests or random decision forests are an ensemble learning method for classification,regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Naïve Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature

Naive bayes: