Lab Assignment 1 - sirisha1206/Spark GitHub Wiki

Name:Naga Sirisha Sunkara

Class ID:21

Team ID:5

Technical partners details:

Name:Vinay Santhosham

Class ID:17

Github Link:sourcecode

Youtube link:video

Objective:

Task1: Implement MapReduce algorithm for finding Facebook common friends problem and run the MapReduce job on Apache Hadoop.

Task2:

Task1:MapReduce algorithm for facebook common friends problem

Algorithm:

Person and List of friends are stored in a file in Person->List of Friends format.

Input:

A -> B C D

B -> A C D E

C -> A B D E

D -> A B C E

E -> B C D

The file with this input should be passed as an input to the mapper.Each line of the file will be an input to the mapper.The output will be a key value pair.The key is person along with the friend and value is list of friends.

A -> B C D:

(A B) (B C D)

(A C) (B C D)

(A D) (B C D)

B -> A C D E:

(A B) (A C D E)

(A C) (A C D E)

(A D) (A C D E)

(A E) (A C D E)

C -> A B D E:

(A C) (A B D E)

(B C) (A B D E)

(C D) (A B D E)

(C E) (A B D E)

D -> A B C E:

(A D) (A B C E)

(B D) (A B C E)

(C D) (A B C E)

(D E) (A B C E)

E -> B C D

(B E ) (B C D)

(C E) (B C D)

(D E) (B C D)

All this key value pairs will be input to the reducer.The reducer output be an intersection of value pairs with same keys. The output of the reducer will be:

A B --> (B C D) & (A C D E) --> (C D)

A C --> (B C D) & (A B D E) --> (B D)

A D --> (B C D) & (A B C E) --> (B C)

B C --> (A C D E) & (A B D E) --> (A D E)

B D --> (A C D E) & (A B C E) --> (A C E)

B E --> (A C D E) & (B C D) --> (C D)

C D --> (A B D E) & (A B C E) --> (A B E)

C E --> (A B D E) & (B C D) --> (B D)

D E --> (A B C E) & (B C D) --> (B C)

Map Reduce Diagram:

We have written the map reduce code in python.The following are the screenshots of code and output.

InputFile:

Mapper Code:

Reducer Code:

Output:

Commands to be used:

To push the input text file to hdfs:

hdfs dfs -copyFromLocal lab1input.txt /lab1input.txt

Checking whether the input the input file is pushed in hdfs:

hdfs dfs -ls /

Command to run map reduce in python

hadoop jar /usr/local/hadoop-2.8.1/share/hadoop/tools/lib/hadoop-streaming-2.8.1.jar -input /lab1input.txt -output /lab1output -mapper /home/hdsirisha/Desktop/lab1/mapper.py -reducer /home/hdsirisha/Desktop/lab1/reducer.py

Checking the output directory:

hdfs dfs -ls /lab1output

View the output of the map reduce job:

hdfs dfs -cat /lab1output/part-00000

Task 2: To implement a solution using NO SQL databases i.e, Cassandra and Hbase And comparing the solutions between them.

Use Case Selected : Coursera

i) Cassandra

Step 1: Creation of the table with columns course_id ( primary key ), course_desc, course_name, duration, enrollments, languages, level, rating in Cassandra.

Step 2: Data inserted into the created table in the Cassandra

Step 3: Queries on the inserted data Query 1: Fetching of data from the table whose course_id is 1,4,7,3

Query 2: Filtering of data based on rating column

Query 3: Fetching of specific columns based on level selected as beginner.

ii) Hbase

Step i : Creation of table in Hbase

Step ii : Insertion of data into table created.

Step iii: Viewing of the data from the table

Step iv: Queries

General Hbase shell commands

Table Management Commands

Data Manipulation Commands