ICP 9 - manaswinivedula/Big-Data-Programming GitHub Wiki

Sbt file dependencies

Below shown are the dependencies added to the build.sbt file.

Task1: K-means clustering

  • It is an unsupervised Machine learning algorithm used to separate the data into the required number of clusters.
  • In the k means clustering initially the data is separated into the clusters with the random centroids for each cluster.
  • Here taken a random sample of 1000 and 10 clusters.
  • Then then distance for each point with each centroid is calculated then a point will be assigned to the cluster to which the distance between the point and euclidean distance is small.
  • The same process continues for each point and new centroids get calculated.
  • This entire process continues till the centroids will become constant.
  • finally, displaying the initial centroids and final centroids in the output.

The output of initial and final centers are as shown below

Merge sort

  • Merge sort follows the divide and conquer approach given input is divided completely to individual elements then they will be merged and sorted simultaneously.

Below is the workflow of the algorithm for the inputs used in the task.

  • Initially, the input is given then it id mapped with a user-defined function named mergesort.
  • In that function, the input is divided into two parts left and right and divides further completely as shown in the above figure.
  • Then these divided elements are sent through an inbuilt function merge where merging and sorting take place and produces the final sorted output.

The sorted output for the given input is as shown below.

Task 3 Depth First Search

  • Depth-first search mainly checks two things whether the node has been visited or not and it follows stack data structure.
  • Initially any node can be considered as visited and all the nodes adjacent to it will be checked and simultaneously kept into the stack till all the adjacent nodes of the visited nodes are traversed. *Then the topmost element in the stack will be taken and check whether it is been visited or not if not that will be added to the visited node list and this process goes on.

The graph that is taken as input for this task is as shown below.

  • Initially, a graph is constructed with their adjacent elements.
  • Then DFS built method is performed and checking whether the element is visited or not.
  • If not then the element is added to the visited list.

The output of all the depth-first search with all the nodes is displayed below.

References:

  1. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/LocalKMeans.scala

  2. https://www.tutorialspoint.com/apache_spark/advanced_spark_programming.htm

  3. https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/