ICP 9 - Murarishetti-Shiva-Kumar/Big-Data-Programming GitHub Wiki

Lesson Plan9: Apache Spark II

Configurations

Add required library dependencies to the build.sbt file.

image

K-Means Clustering Algorithm

K-means is an unsupervised centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid. The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid. First it starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs repetitive calculations in order to optimize the positions of the centroids.

  1. Number of clusters are selected randomly taken as k value

  2. Number of inputs range is also randomly selected

image

  1. Removing all the headers

image

  1. Cluster the data into into three classes using kmeans

image

  1. classify the observations into clusters by calculating the mean square error and centroids of the each cluster are calculated

image

output is as follows:

image

Merge Sort Algorithm

It works on the principle of Divide and Conquer. Merge sort repeatedly breaks down a list into several sublists until each sublist consists of a single element and merging those sublists in a manner that results into a sorted list.

  1. Method merge sort which takes a list as input and it splits the list to the center until the center element becomes 0

2.The value of n becomes the length of list divided by 2

3.After the split the input becomes two sets of lists the right list and the left list and the merge sort method will be called to both and finally the outputs of the two merge sort methods will be combined as a single list

image

Input & Output:

image

DepthFirst Search

The Depth First Search (DFS) is an algorithm for traversing or searching tree or graph data structures which uses the idea of backtracking. It explores all the nodes by going forward if possible or uses backtracking. It helps us to identify whether there is any path between any two nodes. It starts at the child of one node until it reaches another node

  1. Input taken is

image

Internally, 1 has a connection with 7 and 9 and 7 has it with 1 and 8 and 8 has it with 7 and 9 and 9 has it with 1 and 8

image

  1. Input is starts with 1 which is passed in to a DFS method there will be one more method called DFS0 inside DFS which is a recursive function.

  2. Function goes to another node once if a node is already visited once it is visited it goes to the next node

image

  1. output is the list of the nodes taken in the reverse order

image