Exam 1 Part 2: Youtube - gabriellawillis/BigData GitHub Wiki
2. Use Case: Implement MapReduce algorithm to perform analysis on YouTube dataset.
Technologies Used:
- VirtualBox
- Cloudera
- Hadoop
MapReduce:
-
Single master node, many worker nodes
-
Client submits a job to master node
-
Master splits each job into tasks(MapReduce), and assigns tasks to worker nodes
Problem Statement 1:
Find out what are the top 5 categories with maximum number of videos uploaded.
Create Mapper and Reducer Code in order to find the top 5 videos of the set. This will be done through iterations.
Executing after turning java file into jar file
Command: hadoop jar /home/cloudera/Desktop/top5.jar Top5_categories input/youtubedata.txt output
Output for Top 5
Command: hadoop fs -cat output/part-r-00000
Problem Statement 2:
Find the top 10 rated videos on YouTube.
Create Mapper and Reducer Code
Execution after turning into jar file:
Command: hadoop jar /home/cloudera/Desktop/Video_rating.jar Video_rating input/youtubedata.txt output
Output for Video Rating
Command: hadoop fs -cat output/part-r-00000
Youtube Data: https://umkc.app.box.com/s/oiagnrbmolxzc0tdkjw5jxbcfecis0un