ICP_4 - Girees737/KDM_Projects GitHub Wiki

Summary of what I have learned:

In this lesson, I have learn the concepts of Spark and its significance of its usage in real world big data problems.

I have also understood the difference between dataframes in pandas and RDD's in apsark scale and its logical way of storing the data. I have learnt the actions like first, show, take, count, describe etc and transformations like map, flatmap, filtermap, SortBykey etc which will be applied on dataframes and RRD's. Importantly, How we can make use of Hadoop spark for unstructured data like text.

IDE's Used:

JupyterNotebook, Colab

Programming Languages:

Python, Java(In the background)

Libraries:

PySpark

Aim:

  1. To apply atleast 3 spark transformations on a given dataset
  2. To apply atleast 3 spark actions on a given dataset

Implimentation:

  1. Imported the required libraries, set the Java and spark home for pyspark reference in the background and created a spark session.

  1. Loaded the csv data as spark dataframe.

Transformations:

1. Map

Applied it on the tenure column with 2 multiplication and it just duplicated the value and returned the series of map values. Have printed only the top 5 values for our flexibility.

2. FlatMap

Applied it on the tenure column with 2 multiplication and it just double the value and returned output multiplied by 2. Have printed only the top 5 values for our flexibility.

3. Filter

Applied it on a tenure column to filter the dataframe records who tenure is greater than 30 and visualized the first 3 records.

4. OrderBy

Printed the dataframe with order by tenure.

Actions:

  1. Count : To get the no. of dataframe records
  2. Agg : To aggregate on a numerical column

  1. First : To pick the first record of the spark dataframe
  2. Take : To take the n records of the spark dataframe

Data Wrangling:

Implementation Video Link:

IMAGE ALT TEXT