Spark Ques Ans - ayushmathur94/Spark GitHub Wiki
Data is often written in very small files and directories. This data may be spread across a datacenter or even across the world (ie not co-located ). The result is that a query on this data may be very slow due to :
- Network Latency
- Volume of file's metadata
The solution is to compact many small files into one larger file. Delta Lake has a mechanism for compacting small files:
- Delta Lake supports the Optimize operation,which performs file compaction.
- Small files are compacted together into new larger files up to 1 gb.
- The 1 gb size was determined by the Databricks optimization team as a trade - off between query speed and run-time performance when running Optimize.
OUTER join combines data from both dataframes, irrespective of 'on' column matches or not.
on = [ key columns ] how = [ join type ] df1.join(df2 , on=[‘id’], how=‘outer’)
- If there is a match, combined one row is created.
- If there is no match, missing columns for that row are filled with null values.
- Upper - To convert the value in uppercase.
- Lower - To convert the value in Lowercase.
- Length - To get the length of the column.
- Split - To split the values of the column based on the separator.
- Using withColumnRenamed method.
- First parameter - Existing Column Name
- Second parameter - New Column Name
df.withColumnRenamed(“id”,”newId”)