Spark MLlib 14 - praveenpoluri/Big-Data-Programing GitHub Wiki
Aim:
To implement classification algorithms Naive Baye's, Decision tree, Random forest, K-means clustering, Linear regression, Logistic regression on given different datasets given.
Introduction:
Apache Spark and Python for Big Data and Machine Learning Apache Spark is known as a fast, easy-to-use and general engine for big data processing that has built-in modules for streaming, SQL, Machine Learning (ML) and graph processing. ... Evaluating the machine learning model that you made.
Models:
- Machine Learning Library (MLlib)
- summary statistics. correlations. stratified sampling.
- linear models (SVMs, logistic regression, linear regression) decision trees. naive Bayes.
- alternating least squares (ALS)
- k-means.
- singular value decomposition (SVD) principal component analysis (PCA)
- stochastic gradient descent. limited-memory BFGS (L-BFGS)
Functions: Spark's goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering. Featurization: feature extraction, transformation, dimensionality reduction, and selection.
Components used:
- Spark
- Python
- Jupyter notebooks.
- MLlib Libraries.
Implementation:
Task-1: Classification Algorithms:
1. Naive Bayes:
- Imported machine libraries for vector assembler, Naive Bayes and created Spark session, created a dataframe data using spark dataframe read API.
- Created headers for the dataframe, schema as shown below.
- Created labels for dataframe column 10 as shown:
- Created features column vector with columns [1,3,5,11,12] using vector assembler as shown:
- Split the data rows for train and test in 60-40 ratio as shown below:
- Created a naive bays model nb1, trained the model and using test data, calculated accuracy on the data, classified the data as either of two types(0 or 1) as shown in prediction column.
2. Decision tree:
- Created a decision tree model nb3 and fit the training data and using test data made prediction , calculated accuracy from prediction.
3. Random Forest:
- Created a random forest model with 100 trees and fit the model on training data, using test data made predictions, with predictions calculated accuracy.
Task-2: Clustering
1. K-Means:
- Imported K-means, all the MLlb libraries and created spark session , with the diabetic data csv file created a dataframe with selected number of coolumns, using assembler segregated features and columns, trained the k-means cluster fitted the data, calculated distance from each data point.
Task-3: Regression
1. Linear Regression:
- Imported Sparksession, linear regression, vector assembler and all the machine learning algorithms and created a sparksession.
- Loaded the data into dataframe using dataframe reader API and with selected columns.
- Created a feature column with vector assembler, created a Linear regression model with number of iterations as 10 and , regParam as 0.3.
- Fitted the model on training data, printed coefficients, intercepts for linear regression (i.e m and c for y=mx+c), summarized number of iterations, objective History, residuals and meansqaure error.
2. Logistic Regression:
- Imported Sparksession, logistic regression, vector assembler and all the machine learning algorithms and created a sparksession.
- Loaded the data into dataframe using dataframe reader API and with selected columns.
- Created dataframe and labels for the columns of dataframe created.
- Created a feature column with vector assembler, created a Linear regression model with number of iterations as 10 and , regParam as 0.3.
- fitted the model on trained data, calculated and printed coefficients and intercepts for the logistic regression model.
- Created multinomial coefficients, printed multinomial coefficients and mulinomial intercepts are calculated and printed them, area under ROC is calculated.
Limitations:
- No File Management System.
- No Real-Time Data Processing.
- Expensive.
- Small Files Issue.
- Latency.
- The lesser number of Algorithms
Conclusion:
- Implemented Classification algorithms, Clustering, Regression on given datasets.