Module 2: ICP #6 - SnehaMishra28/BigData_Programming_Summer2018 GitHub Wiki
Team: 12
Professor: Yugyung Lee
Name: Sneha Mishra
Class ID: 11
Email: [email protected]
MyGitHub
Technical Partner:
Name: Aditya Soman
Class ID: 19
Email: [email protected]
GitHub
Objective
Understanding of Apache Spark MLIB. MLlib is Apache Spark's scalable machine learning library, with APIs in Java, Scala, Python, and R. Basic understanding of Clustering, Classification, Regression and Recommendation.
Features
Use of Algorithms such as:
- Naïve Bayes.
- Decision Tee.
- Random Tree.
- KMeans3.
- Linear Regression.
- Logistic Regression.
- CollaborativeFiltering : Alternating Least Square.
Steps:
Part 1: Clustering
Part 2: Classification
This task contains working on 3 algorithms namely:
1. Naïve Bayes:
It is a classification technique based on Bayes’ theorem. Naïve Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
Code for the Algorithm:
Output after running the Algorithm:
2. Decision Tree:
Code for the Algorithm:
Output after running the Algorithm:
3. Random Tree:
Code for the Algorithm:
Output after running the Algorithm:
Part 3: Regression
Part 4 (Bonus): Recommendation
References:
- http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://stackoverflow.com/questions/17412439/how-to-split-data-into-trainset-and-testset-randomly
- https://spark.apache.org/docs/1.1.0/api/python/pyspark.mllib.linalg.SparseVector-class.html
Datasets provided are:
-
For Clustering:
https://archive.ics.uci.edu/ml/datasets/Acute+Inflammations https://archive.ics.uci.edu/ml/datasets/Adult -
For Classification:
https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008 https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008 -
For Regression:
https://archive.ics.uci.edu/ml/datasets/Automobile -
For Recommendation:
https://www.kaggle.com/shivendra91/recommendation-als/data