Interview Questions 3 - ignacio-alorre/Spark GitHub Wiki

**What is the SMALL FILE PROBLEM in Big Data? How Delta Lake provices a more optimal solution?

Data is often written in very small files and directories. This data may be spread across a data center or even across the world (that is, not co-located). The result is that a query on this data may be very slow due to:

  • Network Latency
  • Volume of file´s metatadata

The solution is to compact many small files into one larger file. Delta Lake has a mechanism for compacting small files:

  • Delta Lake supports the Optimize operation, which performs file compaction.
  • Small files are compacted together into new larger files up to 1GB.
  • The 1GB size was determined by the Databricks optimization team as a trade-off between query speed and run-time performance when running Optimize.

OUTER Join in Spark

OUTER join combines data from both dataframes, irrespective of 'on' column matches or not.

  • If there is a match combined, one row is created
  • If there is no match, missing columns for that row are filled with null values.
on= [key columns]
how= [join type]

df1.join(df2, on=['id'], how='outer')

UPPER, LOWER, LENGTH & SPLIT function in Pyspark ?

  1. UPPER - To convert the value in Uppercase
  2. LOWER - To convert the value in Lowercase
  3. LENGTH - To get the length of the Column
  4. SPLIT - To split the values of the Column based on the separator ( In the below example - Blank Space is the separator)

How can we RENAME a COLUMN?

  1. Using withColumnRenamed method.
  2. First parameter - Existing Column Name
  3. Second parameter - New Column Name
df.withColumnRenamed("id", "newId")