Lab 2 - Gnkhakimova/CS5590-BigData GitHub Wiki

CS5590 - Big Data Programming

Lab assignment 2

Gulnoza Khakimova - 13
Yujing Wu - 28
Han Zhou - 30
Video Link:
Add video link
Source Code:
Task 1:
Code
Output
Task 2:
-- Code
Task 3
-- Code

Introduction

Apache Spark is open-source framework which is used to perform operations on datasets. It stores data in clusters which helps jobs to run in parallel and maintain fault-tolerance. Spark uses RDD architecture. Spark uses iterative algorithm which loops through the data.

Objectives

We implemented three tasks for Lab assignment #2 using Apache Spark. Use MapReduce algorithm for finding Facebook common friends problem and run the MapReduce job on Apache Spark.
Create dataframe and perform different operations on it.
Perform Word-Count on Twitter Streaming Data using Spark.

Approaches/Methods

Each task needed to be completed using different techniques on different datasets. Task 1
For first task we had to perform common friend finder algorithm, like the one on Facebook, which displays list of common friends. For task 1 we used MapReduce and ran the job on Apache Spark. Dataset was given to us and we stored it in HDFS so we could run our job.

Implementation

First we had to run DFS by calling start-dfs.sh command from command line, which will start all needed nodes. Next step was to create directory in HDFS place the file (dataset) into newly created directory, by calling -copyFromLocal command in HDFS. In order to copy a file we ad to make sure that our datanode is running. We can check running nodes by using JPS command.

For code implementation part we had to specify Hadoop Home and Spark Home variables, need to make sure that both are running on same version of Python. There are two functions in our code:

  1. Map function
    We map the data, by selection user and list of their friends. We store list of friends and give them unique keys
  2. Reduce function
    We find common friends y counting keys.

    We get input dataset from HDFS and output result into HDFS as well.

    As we can see in HDFS we have Output folder which contains success file with list of common friends.

Task 2

Task 3

HIVE USE CASE:

Implementation:

Workflow

Task 1

Pic 1. Ref: https://evantamle.wordpress.com
In Pic 1 we can see a workflow of finding mutual friends using MapReduce algorithm.

Task 2

Datasets

Task 1
For task 1 we were given a dataset which had a list of friends. Dataset can be found here. Output

Task 2

Task 3

Parameters

Task 1:

  • HDFS/Hadoop

  • Apache Spark

  • PyCharm

  • Map Reduce Function
    Task 2:

  • HDFS/Hadoop
    Task 3:

Evaluation & discussion

We discussed our work and best ways of implementation, divided work evenly.

Conclusion

We completed Lab 2 effectually by using best approaches and by trying different methods. For task 1 we had to figure out why datanode is not starting which required some HDFS resets. Given tasks helped us to work on real word problems which demonstrated us best methods which needs to be used in different scenarios.

Contribution

Gulnoza Khakimova - Completed task # 1
Yujing Wu - Completed task # 3
Han Zhou - Completed task # 2

References

https://snap.stanford.edu/data/egonets-Facebook.html
https://evantamle.wordpress.com/2016/03/14/implement-finding-common-friend-with-map-reduce/
https://databricks.com/spark/about