SPARK MERGE SORT - praveenpoluri/Big-Data-Programing GitHub Wiki
Aim:
To apply Merge Sort and Depth first Algorithm using Apache Spark.
Introduction:
Apache Spark is an open-source cluster computing system that provides high-level API in Java, Scala, Python and R. It can access data from HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. And run in Standalone, YARN and Mesos cluster manager. Apache Spark is a general-purpose & lightning fast cluster computing system. It provides a high-level API. For example, Java, Scala, Python, and R. Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Bigdata Hadoop and 10 times faster than accessing data from disk. Spark is written in Scala but provides rich APIs in Scala, Java, Python, and R. Hadoop MapReduce can only perform batch processing. Apache Storm / S4 can only perform stream processing. Apache Impala / Apache Tez can only perform interactive processing Neo4j / Apache Giraph can only perform graph processing Hence in the industry, there is a big demand for a powerful engine that can process the data in real-time (streaming) as well as in batch mode. There is a need for an engine that can respond in sub-second and perform in-memory processing. Apache Spark Definition says it is a powerful open-source engine that provides real-time stream processing, interactive processing, graph processing, in-memory processing as well as batch processing with very fast speed, ease of use and standard interface. This creates the difference between Hadoop vs Spark and also makes a huge comparison between Spark vs Storm.
Software components:
- Apache Spark
- Intellij
- JDK
- Scala
Procedure:
Task 1:
Merge Sort Algorithm:
Algorithm How MergeSort Works: Merge Sort is a Divide and Conquer algorithm. It divides input array in two halves, calls itself for the two halves and then merges the two sorted halves. The merge () function is used for merging two halves. The merge (arr, l, m, r) is key process that assumes that arr[l..m] and arr[m+1..r] are sorted and merges the two sorted sub-arrays into one.
-
In the main function we are reading input into input variable, it has spark context which is starting point of spark application. we are printing input to console.
-
Below is the functionality to split or divide the list into individual elemets:
-
Below is the functionality to sort the list and merge it back into list using mapper.
https://github.com/praveenpoluri/Big-Data-Programing/blob/master/spark-icp2/Documents/merge-out.PNG
Task 2:
DepthFirst Search in Graph in Apache Spark:
Algorithm how depth first search algorithm works: Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a graph) and explores as far as possible along each branch before backtracking
- In the main function we are reading input into input variable, it has spark context which is starting point of spark application. we are printing input to console.
-
Below is the functionality of graphs to search for particular element by dividing data on a tree basis and searching from root of the tree.
-
Output:
Limitations:
- No File Management System.
- No Real-Time Data Processing.
- Expensive.
- Small Files Issue.
- Latency.
- The lesser number of Algorithms
References:
https://www.google.com/search?q=depth+first+search&rlz=1C1CHBF_enUS824US825&oq=depth+first+search&aqs=chrome..69i57j0l7.6794j0j7&sourceid=chrome&ie=UTF-8 https://umkc.app.box.com/s/jvxy3898wufl6z2u8dqx10ov7tapmiif