Spark Ques Ans - ayushmathur94/Spark GitHub Wiki

What is the SMALL FILE PROBLEM in Big Data? How Delta Lake provides a more optimal solution?

Data is often written in very small files and directories. This data may be spread across a datacenter or even across the world (ie not co-located ). The result is that a query on this data may be very slow due to :

Network Latency
Volume of file's metadata

The solution is to compact many small files into one larger file. Delta Lake has a mechanism for compacting small files:

Delta Lake supports the Optimize operation,which performs file compaction.
Small files are compacted together into new larger files up to 1 gb.
The 1 gb size was determined by the Databricks optimization team as a trade - off between query speed and run-time performance when running Optimize.

what is Outer Join?

OUTER join combines data from both dataframes, irrespective of 'on' column matches or not.

on = [ key columns ] 
how = [ join type ] 

df1.join(df2 , on=[‘id’], how=‘outer’)

If there is a match, combined one row is created.
If there is no match, missing columns for that row are filled with null values.

upper, lower, length, split function in pyspark?

Upper - To convert the value in uppercase.
Lower - To convert the value in Lowercase.
Length - To get the length of the column.
Split - To split the values of the column based on the separator.

How we can Rename a column?

Using withColumnRenamed method.
First parameter - Existing Column Name
Second parameter - New Column Name

 

df.withColumnRenamed(“id”,”newId”)