Exam 1 Part 2: Youtube - gabriellawillis/BigData GitHub Wiki

2. Use Case: Implement MapReduce algorithm to perform analysis on YouTube dataset.

Technologies Used:

  • VirtualBox
  • Cloudera
  • Hadoop

MapReduce:

  • Single master node, many worker nodes

  • Client submits a job to master node

  • Master splits each job into tasks(MapReduce), and assigns tasks to worker nodes

Problem Statement 1:

Find out what are the top 5 categories with maximum number of videos uploaded.

Create Mapper and Reducer Code in order to find the top 5 videos of the set. This will be done through iterations. VM Screenshot

Executing after turning java file into jar file

Command: hadoop jar /home/cloudera/Desktop/top5.jar Top5_categories input/youtubedata.txt output

Output for Top 5

Command: hadoop fs -cat output/part-r-00000

Problem Statement 2:

Find the top 10 rated videos on YouTube.

Create Mapper and Reducer Code

Execution after turning into jar file:

Command: hadoop jar /home/cloudera/Desktop/Video_rating.jar Video_rating input/youtubedata.txt output

Output for Video Rating

Command: hadoop fs -cat output/part-r-00000

Youtube Data: https://umkc.app.box.com/s/oiagnrbmolxzc0tdkjw5jxbcfecis0un