Module 1: LAB #1 - VidyullathaKaza/BigData_Programming_Spring2020 GitHub Wiki
OBJECTIVE:
-
To understand and effectively implement map reduce functionality for finding common Facebook friends.
-
Implement map reduce functionality and extract required results from YouTube dataset.
-
Hive - Store the input data from Kaggle into tables using hive commands and execute various queries on the dataset.
-
Solr - Upload the document into Solr and explore many ways of querying the data.
PART-1:
MapReduce functionality applied to find Facebook common friends.
PROBLEM STATEMENT: Facebook has a list of friends. When a person visits someone's profile, there may be a list of friends that person has in common with the profile being visited. Our goal is to find this list of common friends between two people. MapReduce can be used, so that everyone's common friends can be calculated once a day and those results can be stored for quick lookup.
Explanation:
Consider the following users: A, B, C, D, E
They have friends like this:
A -> B C D, B -> A C D E, C -> A B D E, D -> A B C E, E -> B C D
Using MapReduce, we have to figure out the mutual friends so that the output will be like:
[ A , B ] --> [C, D] [ A , C ] --> [B, D] [ A , D ] --> [B, C] [ A , E ] --> [B, C, D] [ B , C ] --> [A, D, E] [ B , D ] --> [A, C, E] [ B , E ] --> [C, D] [ C , D ] --> [A, B, E] [ C , E ] --> [B, D] [ D , E ] --> [B, C]
ALGORITHM The Map Reduce Diagram is as follows for the problem statement
1)The input from the data is listed for each friends (ex: A->B,C,D). This input is extracted and node to node connection is shown in the input data file such (AB,AC,AD)
2)The mapper reads this from the text data file row by row and input is split in the splitter phase.
3)The node to node connection list is sent from mapper to reducer as (AB,AC,AD..etc). This is the key and this information is added into a HashMap where for each node, the list of connected nodes are added into the hash map HashMap<String,List>. ex: A->B,C,D B->A,C,D,E C->A,D,C,E . .
4)In the reduce phase, each node along with every other node is checked to find common nodes between them, using a util retainall function on HashSet ex: A->B,C,D B->A,C,D,E = (A,B) = (C,D) . . . .
5)This list of common friends between each node is shuffled and sorted and pushed to context, from where it is written to output file.
The MapReduce java program CommonFriends.java is as follows.
The input file Friends.txt is pushed into HDFS using following commands
The Maven clean install is performed in the project directory. After the successful install, the jar file is created in the target directory
This jar file executable is run on the input file to get the output. The format of the command is as follows hadoop jar 'jarfile name along with location' 'Classname to run' 'inputfilepath' 'outputfilepath'
The output directory can be seen in HDFS file system. For each pair of friends, common friends are listed. Output.txt
PART-2:
PROBLEM STATEMENT: Map reduce functionality applied to analyze YouTube dataset.
The java code **YoutubeVideos.java **and the below image shows the code for finding out what are the top 5 categories with maximum number of videos uploaded.
The Maven clean install is performed in the project directory. After the successful install, the jar file is created in the target directory
This jar file executable is run on the input file to get the output. The format of the command is as follows** hadoop jar 'jarfile name along with location' 'Classname to run' 'inputfilepath' 'outputfilepath'**
The input data file Sample_Youtube_DataSet.txt is as follows. The data file has been cleaned. Few columns which were too large and of no use such as related video IDs are removed. The file columns are re-arranged as video ID, category , rating and so on..
The output file HDFS is generated after the jar file has been run with input file. Shows the count of all the categories and their count
The data from the output file is taken and sort command is applied on it and first 5 records are printed. This shows the required output for the question
The java file for finding out the top 10 rated videos on YouTube is as follows. YoutubeVideosRatings.java. The jar is run after maven installation.
The output HDFS file shows all the aggregate ratings for all the videos in the file.
The output is sorted and top 10 is printed as follows. YouTubeTop10Ratings
PART-3:
HIVE
We login to Hive and create database, under which tables can be created. Use database command is used to specify hive to use the database name specified.
Create table Zomato with all the columns from the input data file. The complex data types struct is used for Cuisines, array string for locality and map of string, string for currency. The usage and querying of the data from these complex data fields are shown below. The delimiter '$' in the last row of table creation says, the collections such as arrays, struct are separated by $. And maps keys terminated by ':' says, the map key value pairs are separated.
The data file is loaded onto the table
The input zomato.csv file is as follows.
After the data has been loaded onto the table, the following queries can be applied
Query1:
Query2:
Query3:
Query4:
Query5:
Query6:
Query7:
Query8:
Query9:
Query10:
Query11:
Query12:
Solr
The document using Solr for indexing and then queried.
The Solr new instance directory is created, and a collection named Zomato is created. In that collection we upload a document which is indexed by Solr Lucene.
The Schema.xml shows the various column ids and the unique Ids that the document contains.
The data is updated using update request handler.
Once the document column ids and the schema.xml file are compatible, the data gets uploaded and indexed. The success message shows this.
The following are the queries executed using select request handler in the query window.
Query1:
Query2:
Query3:
Query4:
Query5:
Query6:
Query7:
Query8:
Query9:
Query10:
Query11:
Query12: