Lab Assignment 1 - sirisha1206/Spark GitHub Wiki
Name:Naga Sirisha Sunkara
Class ID:21
Team ID:5
Technical partners details:
Name:Vinay Santhosham
Class ID:17
Github Link:sourcecode
Youtube link:video
Objective:
Task1: Implement MapReduce algorithm for finding Facebook common friends problem and run the MapReduce job on Apache Hadoop.
Task2:
Task1:MapReduce algorithm for facebook common friends problem
Algorithm:
Person and List of friends are stored in a file in Person->List of Friends format.
Input:
A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E
E -> B C D
The file with this input should be passed as an input to the mapper.Each line of the file will be an input to the mapper.The output will be a key value pair.The key is person along with the friend and value is list of friends.
A -> B C D:
(A B) (B C D)
(A C) (B C D)
(A D) (B C D)
B -> A C D E:
(A B) (A C D E)
(A C) (A C D E)
(A D) (A C D E)
(A E) (A C D E)
C -> A B D E:
(A C) (A B D E)
(B C) (A B D E)
(C D) (A B D E)
(C E) (A B D E)
D -> A B C E:
(A D) (A B C E)
(B D) (A B C E)
(C D) (A B C E)
(D E) (A B C E)
E -> B C D
(B E ) (B C D)
(C E) (B C D)
(D E) (B C D)
All this key value pairs will be input to the reducer.The reducer output be an intersection of value pairs with same keys. The output of the reducer will be:
A B --> (B C D) & (A C D E) --> (C D)
A C --> (B C D) & (A B D E) --> (B D)
A D --> (B C D) & (A B C E) --> (B C)
B C --> (A C D E) & (A B D E) --> (A D E)
B D --> (A C D E) & (A B C E) --> (A C E)
B E --> (A C D E) & (B C D) --> (C D)
C D --> (A B D E) & (A B C E) --> (A B E)
C E --> (A B D E) & (B C D) --> (B D)
D E --> (A B C E) & (B C D) --> (B C)
Map Reduce Diagram:
We have written the map reduce code in python.The following are the screenshots of code and output.
InputFile:
Mapper Code:
Reducer Code:
Output:
Commands to be used:
To push the input text file to hdfs:
hdfs dfs -copyFromLocal lab1input.txt /lab1input.txt
Checking whether the input the input file is pushed in hdfs:
hdfs dfs -ls /
Command to run map reduce in python
hadoop jar /usr/local/hadoop-2.8.1/share/hadoop/tools/lib/hadoop-streaming-2.8.1.jar -input /lab1input.txt -output /lab1output -mapper /home/hdsirisha/Desktop/lab1/mapper.py -reducer /home/hdsirisha/Desktop/lab1/reducer.py
Checking the output directory:
hdfs dfs -ls /lab1output
View the output of the map reduce job:
hdfs dfs -cat /lab1output/part-00000
Task 2: To implement a solution using NO SQL databases i.e, Cassandra and Hbase And comparing the solutions between them.
Use Case Selected : Coursera
i) Cassandra
Step 1: Creation of the table with columns course_id ( primary key ), course_desc, course_name, duration, enrollments, languages, level, rating in Cassandra.
Step 2: Data inserted into the created table in the Cassandra
Step 3: Queries on the inserted data Query 1: Fetching of data from the table whose course_id is 1,4,7,3
Query 2: Filtering of data based on rating column
Query 3: Fetching of specific columns based on level selected as beginner.
ii) Hbase
Step i : Creation of table in Hbase
Step ii : Insertion of data into table created.
Step iii: Viewing of the data from the table
Step iv: Queries
General Hbase shell commands
Table Management Commands
Data Manipulation Commands