Lab Assignment 1 - GeoSnipes/Big-Data GitHub Wiki

ICP Team Id: 5-2

Member 1 : Pranoop Mutha Class id: 15

Member 2 Name: Geovanni, West Class id: 23

For this assignment, we have done in both Python by Geo and Scala by Pranoop. There is an equal contribution from both of us.

Objective:

  • Using Spark Transformations and Actions,we need to find the users who have rated more than 25 items from a movielens data set which consists of 1,00,000 movie ratings by 943 users on 1682 items.
  • Create GitHub Account
  • CreateZenHubTool Account with 3 milestones, and atleast 5 issues and show the analytics graph.

Spark Transformations and Actions:

The below transformations and actions are being used by us in this lab assignment.

  • map(func) : Return a new distributed dataset formed by passing each element of the source through a function func.

  • filter(func) : Return a new dataset formed by selecting those elements of the source on which func returns true.

  • reduceByKey(func,[numTasks]) : When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

  • sortByKey([ascending], [numTasks]) : When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

  • saveAsTextFile : rite the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

Input Data

Scala Code performing Spark Transformations and Actions

Python Code to Transform and perform Actions

Output from Python

Output from Scala

GitHub Account

Github Remote Repository:

Cloning the repository to desktop:

Cloned folder in local:

LabAssignments and ICP folder in the main folder:

Lab 1 Folder:

Documentation and Source folders in Lab 1:

Screenshots in documentation folder:

Source code for Python and Scala in Source Code Folder:

Zenhub:

Issues:

Zenhub Board:

Milestones

Burndown Charts or Graphs:

Source Code Link: Scala: https://github.com/GeoSnipes/Big-Data/tree/master/lab_assignments/Lab%201/src/Scala/CS5542-Lab1-SourceCode/Spark%20WordCount Python: https://github.com/GeoSnipes/Big-Data/tree/master/lab_assignments/Lab%201/src/Python