Big_Data_Programming_ICP_7_Module2 - kusamdinesh/Big-Data-and-Hadoop GitHub Wiki

Machine Learning Library (MLlib)

  • MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives

  • As organizations create more diverse and more user-focused data products and services, there is a growing need for machine learning, which can be used to develop personalizations, recommendations, and predictive insights. Traditionally, data scientists are able to solve these problems using familiar and popular tools such as R and Python. But as organizations amass greater volumes and greater varieties of data, data scientists are spending a majority of their time supporting their infrastructure instead of building the models to solve their data problems.

  • To help solve this problem, Spark provides a general machine learning library -- MLlib -- that is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, and speed of Spark, data scientists can solve and iterate through their data problems faster

  • Installation Requirements : Pyspark is chosen with 2.1.0.

  • Challenges

1. Classification

DataSet: https://github.com/kusamdinesh/Big-Data-and-Hadoop/blob/master/ICP7_Module2/Source%20Code/ICP7/datasets/adult.data

  • Recommendations for classification algorithms :

Several attempts has been made to use the column with >50K and <50K as labels, but the performance has been really low. which suggests the correlation matrix between the classes has not been particularly good. However, on comparison with age, the results have improved. However, Random Forest has been better performing in comparison with others. Apparently, the whole performance is not so great. Features thought would be better suited are : level of education hours spent for week

Naive Bayes

  • Accuracy: 0.0010357824733834052
  • Precision: 0.00749467071935157
  • Recall: 0.007497467071935157
  • F-measure: 0.007497467071935157

Decision Tree

  • Accuracy: 0.031097722914741607
  • Precision: 0.05311393459695897
  • Recall: 0.05311393459695897
  • F-measure: 0.05311393459695897

Random Forest

  • Accuracy: 0.026348217075756374
  • Precision: 0.04772117962466488
  • Recall: 0.04772117962466488
  • F-measure: 0.04772117962466488

Naïve Bayes :

In machine learning we are often interested in selecting the best hypothesis (h) given data (d). In a classification problem, our hypothesis (h) may be the class to assign for a new data instance (d). One of the easiest ways of selecting the most probable hypothesis given the data that we have that we can use as our prior knowledge about the problem. Bayes’ Theorem provides a way that we can calculate the probability of a hypothesis given our prior knowledge. Bayes’ Theorem is stated as: P(h|d) = (P(d|h) * P(h)) / P(d)

Code:

Output:

Decision Tree

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.

Code:

Output:

Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.[1][2] Random decision forests correct for decision trees' habit of overfitting to their training set

Code:

Output:

2.Clustering:

DataSet: https://github.com/kusamdinesh/Big-Data-and-Hadoop/blob/master/ICP7_Module2/Source%20Code/ICP7/datasets/dataset_diabetes/diabetic_data.csv

Recommendations:

features taken into consideration: to identify diabetes, from the record available, I think below are the right set of features that we can use to train the model.

  • admission_type_id
  • discharge_disposition_id
  • admission_source_id
  • time_in_hospital
  • num_lab_procedures
  • Attempted with K value as 3: Cluster Centers: [ 1.88976688 3.66805625 5.01553426 4.01750456 40.90851585] [ 1.85876347 3.79571305 6.22673672 5.57739499 64.11690599] [ 2.62279958 3.69630479 6.74213335 3.33256869 13.24001396]
  • Attempted with K value as 2: Cluster Centers: [ 2.28821023 3.58880208 5.55565814 3.50350379 24.50454545] [ 1.83652522 3.80564795 5.89549105 5.02929812 56.2879918 ]

when k means is 2 we have centers as distinguishable. its not the case when K = 3. so K=2 seems to be ideal for the data provided.

k-means clustering

  • k-means clustering is a method of vector quantification, originally from signal processing, that is popular for cluster analysis in data mining.
  • k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
  • k-Means minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances.

Code:

Output:

Regression:

Dataset: https://github.com/kusamdinesh/Big-Data-and-Hadoop/blob/master/ICP7_Module2/Source%20Code/ICP7/datasets/imports-85.data

  • Recommendations Features selected for observing the model performance or prediction:
  • length
  • width
  • height

Linear Regression Coefficients: [-0.0001822746428449108,0.0,-0.1686703625066946] Intercept: 9.92766576579 RMSE: 1.076342 r2: 0.249292

Logistic Regression

Coefficients: [0.0,0.0,0.000100509510875788] Intercept: 0.225315324107

Linear Regression:

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression.

Code:

Output:

Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression)

Code:

Output: