Interview Questions 3 - ignacio-alorre/Spark GitHub Wiki
**What is the SMALL FILE PROBLEM in Big Data? How Delta Lake provices a more optimal solution?
Data is often written in very small files and directories. This data may be spread across a data center or even across the world (that is, not co-located). The result is that a query on this data may be very slow due to:
- Network Latency
- Volume of file´s metatadata
The solution is to compact many small files into one larger file. Delta Lake has a mechanism for compacting small files:
- Delta Lake supports the Optimize operation, which performs file compaction.
- Small files are compacted together into new larger files up to 1GB.
- The 1GB size was determined by the Databricks optimization team as a trade-off between query speed and run-time performance when running Optimize.
OUTER Join in Spark
OUTER join combines data from both dataframes, irrespective of 'on' column matches or not.
- If there is a match combined, one row is created
- If there is no match, missing columns for that row are filled with null values.
on= [key columns]
how= [join type]
df1.join(df2, on=['id'], how='outer')
UPPER, LOWER, LENGTH & SPLIT function in Pyspark ?
- UPPER - To convert the value in Uppercase
- LOWER - To convert the value in Lowercase
- LENGTH - To get the length of the Column
- SPLIT - To split the values of the Column based on the separator ( In the below example - Blank Space is the separator)
How can we RENAME a COLUMN?
- Using
withColumnRenamed
method. - First parameter - Existing Column Name
- Second parameter - New Column Name
df.withColumnRenamed("id", "newId")