Exam 2 - manaswinivedula/Big-Data-Programming GitHub Wiki

Project-based exam 2

Team Members

Team Id -5

  1. Manaswini Vedula Class id -6
  2. Srujan reddy Tirupally Class id -7
  3. Alekhya Thangallapelly Class id -16
  4. Mohit Sriram Tirumala Class id -15

Contributions and Links

Manaswini Vedula -Task1

Srujan reddy -Task2 link for profile

Alekhya Thangallapelly- Task 3 link for profile

Mohit Sriram Tirumala - Task4 link for profile

Task1:

Introduction: Facebook is one of the most popular social media platforms in today's world. It is generating petabytes of data every day and numerous new accounts will be created, and millions of new connections will be established each day. So, to know whether the person who sent you a request on Face book we can come to know that with the help of finding the mutual connections on Facebook. Objective: To write a map-reduce algorithm to find mutual friends for the given input pairs using Apache Spark.

Approach:

The Map-reduce algorithm consists of 4 phases 1.Splitting Phase 2. Mapper Phase 3.Shuffling Phase 4.Reducer Phase

The following diagram shows us how the input file gone through different phases to generate the output.

After creating the project in the Intellij the library dependencies are added to the build.sbt file.

Splitting and Mapping Method:

  1. Initially, Sparkcontext is created in order to start the spark-shell.
  2. In the user-defined Map function, each line from the input file is taken and splitting is performed on each line based on space between the values.
  3. After Splitting operation the first split word is taken as key and the remaining will be considered as values. 4.for each value it is compared with the key if the key is less than value then (key, value) will be returned else (value, key) will be taken as pairs.
  4. Then the pairs will be mapped to the values and are stored in the form of sets.

Shuffling and reducing method:

  1. The shuffling is done internally and, in the reducer, phase the keys will be compared and the common keys values will be combined.

  1. Now the file is read, and the flat Map operation is being performed on the user-defined Map function, and Reduce by key is performed on the user-defined reduce function.

  2. The results are been printed on to the console and saved as a text file.

Workflow:

  1. Initially, the sample input file is given as input to the code. The file is as shown below.

  1. The mutual friends of the above-shown file on the console is as follows

  1. The saved and generated output of the file after performing Map-reduce is as follows.

  1. The given facebook_combined.txt is given as an input file to the code. The file is as shown below.

  1. The mutual friends of the above-shown file on the console is as follows

  1. The saved and generated output of the file after performing Map-reduce is as follows.

Task5

1. Explain the idea of your work done for this Exam briefly.

I have created a Mapper and reducer functionality to find the mutual friends in Face book, then read the input file and then applied flat map using the user-defined map function, and performed reduce by key using the user-defined reducer function and then finally saved the final generated output into a folder.

2. Explain the usage of the above all questions in today’s World.

Task1- It is used to find mutual friends on social media platforms like Facebook, Twitter, Instagram in order to make the connectivity easy and help people to expand their network which is beneficial in many ways.

Task2- It is used to find the patterns in the underlying data or to bring out meaningful insights from the data, mainly this task plays a key role in Data analytics.

Task 3- Spark Streaming is one of the important features of the spark. As the internet became part of our lives data generated per second is huge so to streaming that data is the main thing to perform or deduce something from the data.

Task 4 – Page ranking is very useful for getting the best quality of the website rather than getting some other unnecessary sites.

3. Mention the portion of the project clearly which you have worked.

I have worked on the logic, video, and documentation related to the task1.

4. What challenges you faced during the development process.

The challenge I faced was after writing up the mapper and reducer user-defined functions I stuck at a point on how to display the pairs of value then finally came up with an idea of storing them in the form of sets.

5. Explain the milestones of your project and briefly discuss how did you integrate your part (e.g. based on queries etc.) with other team member work and what issues you faced e.g. compatibility.

The milestone of this project was to Stream the data directly from Facebook and then performing the mutual friends Map-reduce algorithm on the real-time data. As the Spark is used to perform the operations of Big data with low latency. Everyone worked on separate tasks individually but as a team, we discussed the approach for each task.

References:

  1. https://umkc.box.com/s/qv7iw89pzdafhg1308ztymx4qqe70awf
  2. https://umkc.box.com/s/ujksg7hnkoz6yg5oxdqy0z5dp83fjbwp