Spark Ques Ans - ayushmathur94/Spark GitHub Wiki

What is the SMALL FILE PROBLEM in Big Data? How Delta Lake provides a more optimal solution?

Data is often written in very small files and directories. This data may be spread across a datacenter or even across the world (ie not co-located ). The result is that a query on this data may be very slow due to :

  1. Network Latency
  2. Volume of file's metadata

The solution is to compact many small files into one larger file. Delta Lake has a mechanism for compacting small files:

  • Delta Lake supports the Optimize operation,which performs file compaction.
  • Small files are compacted together into new larger files up to 1 gb.
  • The 1 gb size was determined by the Databricks optimization team as a trade - off between query speed and run-time performance when running Optimize.

what is Outer Join?

OUTER join combines data from both dataframes, irrespective of 'on' column matches or not.

on = [ key columns ] 
how = [ join type ] 

df1.join(df2 , on=[‘id’], how=‘outer’)
  • If there is a match, combined one row is created.
  • If there is no match, missing columns for that row are filled with null values.

upper, lower, length, split function in pyspark?

  1. Upper - To convert the value in uppercase.
  2. Lower - To convert the value in Lowercase.
  3. Length - To get the length of the column.
  4. Split - To split the values of the column based on the separator.

How we can Rename a column?

  • Using withColumnRenamed method.
  • First parameter - Existing Column Name
  • Second parameter - New Column Name
 

df.withColumnRenamed(“id”,”newId”)

⚠️ **GitHub.com Fallback** ⚠️