Module 2: ICP #6 - SnehaMishra28/BigData_Programming_Summer2018 GitHub Wiki

Team: 12
Professor: Yugyung Lee

Name: Sneha Mishra
Class ID: 11
Email: [email protected]
MyGitHub

Technical Partner:
Name: Aditya Soman
Class ID: 19
Email: [email protected]
GitHub

Objective

Understanding of Apache Spark MLIB. MLlib is Apache Spark's scalable machine learning library, with APIs in Java, Scala, Python, and R. Basic understanding of Clustering, Classification, Regression and Recommendation.

Features

Use of Algorithms such as:

  • Naïve Bayes.
  • Decision Tee.
  • Random Tree.
  • KMeans3.
  • Linear Regression.
  • Logistic Regression.
  • CollaborativeFiltering : Alternating Least Square.

Steps:

Part 1: Clustering

Part 2: Classification

This task contains working on 3 algorithms namely:

1. Naïve Bayes:

It is a classification technique based on Bayes’ theorem. Naïve Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Code for the Algorithm:

Output after running the Algorithm:

2. Decision Tree:

Code for the Algorithm:

Output after running the Algorithm:

3. Random Tree:

Code for the Algorithm:

Output after running the Algorithm:

Part 3: Regression

Part 4 (Bonus): Recommendation

References:

  1. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
  2. https://stackoverflow.com/questions/17412439/how-to-split-data-into-trainset-and-testset-randomly
  3. https://spark.apache.org/docs/1.1.0/api/python/pyspark.mllib.linalg.SparseVector-class.html

Datasets provided are:

  1. For Clustering:
    https://archive.ics.uci.edu/ml/datasets/Acute+Inflammations https://archive.ics.uci.edu/ml/datasets/Adult

  2. For Classification:
    https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008 https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

  3. For Regression:
    https://archive.ics.uci.edu/ml/datasets/Automobile

  4. For Recommendation:
    https://www.kaggle.com/shivendra91/recommendation-als/data