SPARK DATA FRAME - praveenpoluri/Big-Data-Programing GitHub Wiki

DATA FRAMES and SQL over DATA-FRAMES

Aim:

To import a CSV file, create a dataframe on imported CSV file and to implement various aggregate and groupby operations on dataframe.

Introduction to Data-Frames:

In Spark, DataFrames are the distributed collections of data, organized into rows and columns. Each column in a DataFrame has a name and an associated type. DataFrames are similar to traditional database tables, which are structured and concise. We can say that DataFrames are relational databases with better optimization techniques.

Spark DataFrames can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. DataFrames allow the processing of huge amounts of data.

When Apache Spark 1.3 was launched, it came with a new API called DataFrames that resolved the limitations of performance and scaling that occur while using RDDs.

When there is not much storage space in memory or on disk, RDDs do not function properly as they get exhausted. Besides, Spark RDDs do not have the concept of schema—the structure of a database that defines the objects of it. RDDs store both structured and unstructured data together, which is not very efficient.

RDDs cannot modify the system in such a way that it runs more efficiently. RDDs do not allow us to debug errors during the runtime. They store the data as a collection of Java objects.

Some of the unique features of DataFrames are: Use of Input Optimization Engine: DataFrames make use of the input optimization engines, e.g., Catalyst Optimizer, to process data efficiently. We can use the same engine for all Python, Java, Scala, and R DataFrame APIs. Handling of Structured Data: DataFrames provide a schematic view of data. Here, the data has some meaning to it when it is being stored. Custom Memory Management: In RDDs, the data is stored in memory, whereas DataFrames store data off-heap (outside the main Java Heap space, but still inside RAM), which in turn reduces the garbage collection overload. Flexibility: DataFrames, like RDDs, can support various formats of data, such as CSV, Cassandra, etc. Scalability: DataFrames can be integrated with various other Big Data tools, and they allow processing megabytes to petabytes of data at once.

Tools Used:

Apache Spark
Intellij
Scala

Implementation:

Task1:

Downloaded a health-care dataset which is in csv format and opened it in Intellij. Created a dataframe on the csv file and created a copy file for the dataframe created.

output:

Creating copy of dataframe:

Getting duplicate data:

output duplicates:

For Gender column to seperate gender and to clean up Gender coulmn created a UDF and implemented it on dataframe created and grouped the gender and country column and created a new dataframe with all these specs.

Ordered the created dataframe with respect to country.

output:

Gender using UDF:

created a new dataframe with groupby operation on the "Treatment" column of dataframe df which is created initially.

Task2:

Created a dataframe with total number of males from each country and their cities from df datafarme initially created.

Created a dataframe with Males with age more than 35 from df dataframe.

Created a dataframe by Fetching the 13th row of df dataframe.

Limitations:

With DataFrames you loose type-safety. Depending on the language you are using this can also be considered a drawback.
instead the operations are built and optimized for by using the information available in the DataFrame's schema.

Conclusion:

Dataframes are the most powerful API for Apache Spark and Big Data Eco-System.

References:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package https://www.scala-lang.org/api/current/ https://spark.apache.org/docs/latest/sql-programming-guide.html