Interview Questions 3 - ignacio-alorre/Spark GitHub Wiki

**What is the SMALL FILE PROBLEM in Big Data? How Delta Lake provices a more optimal solution?

Data is often written in very small files and directories. This data may be spread across a data center or even across the world (that is, not co-located). The result is that a query on this data may be very slow due to:

Network Latency
Volume of file´s metatadata

The solution is to compact many small files into one larger file. Delta Lake has a mechanism for compacting small files:

Delta Lake supports the Optimize operation, which performs file compaction.
Small files are compacted together into new larger files up to 1GB.
The 1GB size was determined by the Databricks optimization team as a trade-off between query speed and run-time performance when running Optimize.

OUTER Join in Spark

OUTER join combines data from both dataframes, irrespective of 'on' column matches or not.

If there is a match combined, one row is created
If there is no match, missing columns for that row are filled with null values.

on= [key columns]
how= [join type]

df1.join(df2, on=['id'], how='outer')

UPPER, LOWER, LENGTH & SPLIT function in Pyspark ?

UPPER - To convert the value in Uppercase
LOWER - To convert the value in Lowercase
LENGTH - To get the length of the Column
SPLIT - To split the values of the Column based on the separator ( In the below example - Blank Space is the separator)

How can we RENAME a COLUMN?

Using withColumnRenamed method.
First parameter - Existing Column Name
Second parameter - New Column Name

df.withColumnRenamed("id", "newId")