ICP 10 - Murarishetti-Shiva-Kumar/Big-Data-Programming GitHub Wiki

Lesson Plan10: Data Frame and SQL

Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Therefore, our first task is to download Java.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

image

Next, we will install Apache Spark 3.1.1 with Hadoop 2.7

!wget -q https://www-us.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

image

Now, we just need to unzip that folder.

!tar xf spark-3.1.1-bin-hadoop2.7.tgz

image

There is one last thing that we need to install and that is the findspark library. It will locate Spark on the system and import it as a regular library.

!pip install -q findspark

image

Now that we have installed all the necessary dependencies in Colab, it is time to set the environment path. This will enable us to run Pyspark in the Colab environment.

import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

image

We need to locate Spark in the system. For that, we import findspark and use the findspark.init() method.

import findspark

findspark.init()

image

If you want to know the location where Spark is installed, use findspark.find()

image

Now, we can import SparkSession from pyspark.sql and create a SparkSession, which is the entry point to Spark.

You can give a name to the session using appName() and add some configurations with config() if you wish.

from pyspark.sql import SparkSession

spark = SparkSession.builder
.master("local")
.appName("BDP ICP10")
.config('spark.ui.port', '4050')
.getOrCreate()

image

print the SparkSession variable.

spark

image

If you want to view the Spark UI, you would have to include a few more lines of code to create a public URL for the UI page.

!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip

!unzip ngrok-stable-linux-amd64.zip

get_ipython().system_raw('./ngrok http 4050 &')

!curl -s http://localhost:4040/api/tunnels

image

image

Part – 1

1. Import the dataset and create data frames directly on import.

image

image

2. Save data to file.

image

image

image

3. Check for Duplicate records in the dataset.

image

Checking whether any duplicates are present.

image

4. Apply Union operation on the dataset and order the output by Country Name alphabetically.

Creating two tables based on gender(Male, Female) and merging them using union operation

image

5. Use Groupby Query based on treatment.

image

Part – 2

1. Apply the basic queries related to Joins and aggregate functions (at least 2)

Join Function

  1. Selecting the people from male and female tables joined by country

image

  1. Selecting the people from male and female tables joined by State

image

Aggregate Function

Calculating sum of age of all the people

image

2. Write a query to fetch 13th Row in the dataset.

Identifying the 13th row of the dataframe using limit 13 then displaying it

image

13th row in the csv file.

image

Mounting the drive to push the generated csv files to drive

image