Lab 1 - Gnkhakimova/CS5590-BigData GitHub Wiki
CS5590 - Big Data Programming
Lab assignment 1
Gulnoza Khakimova - 13
Yujing Wu - 28
Han Zhou - 30
Video Link:
Video
Source Code:
Task 1:
Code
Output
Task 2:
Code 1
Code 2
Task 3
Code
Introduction
MapReduce is a framework which is used to perform operations on large datasets. Data is divided into several parts and computed in parallel which speeds up a process. Framework contains two parts, data is sorted in Map function which runs in parallel on chunks of data, output form Map function is fed to Reduce function which performs requested computations, e.g Work Count. Hive is a software used to run queries using SQL on large datasets located on distributed storages, which helps to select, join data. Solr is a software used to search and analyze large datasets.
Objectives
There are three tasks in Lab assignment #1, which need to be completed using different techniques and algorithms. Find common friends using MapReduce algorithm. Analyze Youtube dataset using MapReduce and run several queries. Perform 10 queries in dataset B of Hive use case.
Approaches/Methods
Different approaches were taken in order to complete given tasks.
Task 1
Facebook has a system which will show common friend if you access an account, for this task we had to perform common friend finder algorithm using MapReduce function in Hadoop.
Implementation
Created directory in HDFS using hadoop commands. Added input file (dataset) into newly created folder in HDFS.
Using Eclipse IDE created common friends project. Implemented Map function inside CommonFriend class. Map function accepts input file, creates tokonized which will loop through our dataset. Out Maps data into 3 arrays, lineArray, friendArray and tempArra, by storing main person with a list of his friends, mapping is continued for all profiles.
Also created reduce function inside CommonFriend class. Inside reducer we loop through out mapped data and find common friends by comparing values inside each list, if the values are equal we created new list and put common friend’s name inside it.
Inside main function we accept arguments with path to input and output files. Created a job which will call map and reduce functions.
After successfully building a class I export a project into .jar format file, so it could run using Hadoop commands.
Below command shows how to run .jar project using Hadoop. as we can see out job started to run, first it execute Map function, followed by reducing function.
After successfully running a code we can see that it produces an output file which will have a list of common friends.
Now we can display what is inside out output file
Below is an output file which contains list of common friends first two numbers represent two facebook profiles and rest of it their common friends.
Task 2
In this task, we use YouTube dataset. Using this data set, we perform some analysis and try to apply MapReduce methods to handle data and sort the output. There are two problems we solve here.
Data set description:
Column1: Video id of 11 characters.
Column2: uploader of the video.
Column3: Interval between the day of establishment of YouTube and the date of uploading of the video.
Column4: Category of the video.
Column5: Length of the video.
Column6: Number of views for the video.
Column7: Rating on the video.
Column8: Number of ratings given for the video.
Column9: Number of comments done on the videos.
Column10: Related video ids with the uploaded video.
First problem statement:
Find out what are the top 5 categories with maximum number of videos uploaded.
Solution steps:
1: Download the .txt file of YouTube data.
2: Write MapReduce project in eclipse. First of all, through the problem statement, we can figure out that we have to group the data by category of the video(column 4) and output category and the number of occurrences of each category by category.
3: In this step, we write the map function to input the dataset. We set a str[] array to store category value of each line. And for reading data of each line, we first split the text file by ‘\t’. After that we put the fourth string value into our str[] array we have defined at the first place.
4: After writing data in the category text, we try to do the shuffle and reduce in reduce function. We overriding the reduce method which run each time for every key. Then we declare a new variable to store the sum of all the values for each key and define a loop to calculate the sum.
5: Then we finish the configuration code.
6: The last step in eclipse, we pack the project and export it as .jar file.
7: Then we execute the .jar file in Hadoop. First create a input directory and put the data to HDFS. Next execute the project using jar command in Hadoop.
8: If the execution success, there will be a output file which contains the output of the project.
We list the file and show the details of the file. The result is shown as follows.
9: Last , we can sort the values and view the top 5 categories by command in the HDFS.
Second problem statement:
Find out what are the top 10 rated videos on YouTube.
Solution steps:
1: Download the .txt file of YouTube data.
2: Write MapReduce project in eclipse. First of all, through the problem statement, we can figure out that we have to group the data by id of the video(column 1) and output id of the video and the rating of each video .
3: In this step, we write the map function to input the dataset. We would get the id of the video to be the output value and the rating to be output key. We set a str[] array to store id of the video(first column) of each line. And for reading data of each line, we first split the text file by ‘\t’. After that we put the first string value into our str[] array we have defined at the first place. Then we get the seventh column of rating as the key.
4: After writing data in the text, we try to do the shuffle and reduce in reduce function. We overriding the reduce method which run each time for every key. Then we declare a new variable ‘l’ to count number of values are there for that key and variable ‘sum’ to calculate the sum of the value. Then calculate the average of the sum and set average as value.
5: Then we finish the configuration code.
6: The last step in eclipse, we pack the project and export it as .jar file.
7: Then we execute the .jar file in Hadoop. First create a input directory and put the data to HDFS. Next execute the
project using jar command in Hadoop.
8: If the execution success, there will be a output file which contains the output of the project.
We list the file and show the details of the file. Last , we can sort the values and view the top 10 ratings of videos by command in the HDFS. The result is shown as follows.
Task 3
HIVE USE CASE:
a.Create a Hive Table including Complex Data Types
b.Usebuilt-in functions in your queries
c.Perform 10 intuitive questions in Dataset (e.g.: pattern recognition, topic discussion, most important terms, etc.). Use your innovation to think out of box.
Request 1:
For heroPower table, list heroes’ name, agility and matter absorption of the top 10 heroes with ascending order of name.
Request 2:
For heroInfo table, list all of the top 10 with ascending order of id.
Request 3:
List heroes’ name, gender, publisher and agility from two tables with the top 10 of ascending order of id.
Request 4:
For heroInfo table, list heroes’ id, name, gender, race, publisher of all heroes with the publisher including ‘Marvel’. Finding the top 10 with ascending order of id.
Request 5:
For heroInfo table, list heroes’ id, name, gender, race, publisher of all heroes with the publisher including ‘Marvel’. Finding the top 10 with descending order of id.
Request 6:
For heroPower table, list hero Captain America’s all power status.
Request 7:
List hero Joker’s hair color and super strength status.
Request 8:
List 10 results of heroes’ name and agility status in human race.
Request 9:
List 10 results of heroes’ name and accelerated healing status with good alignment heroes.
Request 10:
List 10 results of heroes’ id, name, gender and race with the cold resistance status is True and in good alignment.
Implementation:
Download the dataset B and load it to hive.
Query 1
Query 2
Query 3
Query 4
Query 5
Query 6
Query 7
Query 8
Query 9
Query 10
Workflow
Task 1
Pic 1. Ref: https://evantamle.wordpress.com
In Pic 1 we can see a workflow of finding mutual friends using MapReduce algorithm.
Task 2
this is the workflow for problem1 of task 2
this is the workflow for problem2 of task 2
Datasets
Task 1
For Task 1 I used dataset from here. (soc-LiveJournal dataset). Dataset consists of numbers which represents a facebook profiles.
Task 2
For task 2 YouTube dataset is open source. And here is the dataset description.
Dataset description:
Column1: Video id of 11 characters.
Column2: uploader of the video.
Column3: Interval between the day of establishment of YouTube and the date of uploading
of the video.
Column4: Category of the video.
Column5: Length of the video.
Column6: Number of views for the video.
Column7: Rating on the video.
Column8: Number of ratings given for the video.
Column9: Number of comments done on the videos.
Column10: Related video ids with the uploaded video.
Task 3
B.Super Heros Dataset
https://www.kaggle.com/claudiodavi/superhero-set/data
Parameters
Task 1:
- HDFS/Hadoop
- Eclipse
- Map Reduce Function
Task 2: - Cloudera
- Eclipse
- HDFS/Hadoop
Task 3: - Cloudera
- Hive
Evaluation & discussion
Before implementing Lab assignment #1 we discussed nest approaches and methods in order to complete a given task. We decided to split the tasks among 3 team members.
Conclusion
In given tasks we saw how MapReduce, Hive and Solr can be used in real world problems and how efficiently it can be implemented using different techniques. Hadoop and HDFS makes execution of big datasets faster, by splitting jobs among workers and running each job in parallel.
Contribution
Gulnoza Khakimova - Completed task # 1
Yujing Wu - Completed task # 3
Han Zhou - Completed task # 2
References
http://stevekrenzel.com/finding-friends-with-mapreduce
https://evantamle.wordpress.com/2016/03/14/implement-finding-common-friend-with-map-reduce/
https://acadgild.com/blog/mapreduce-use-case-youtube-data-analysis