Optimize Spark SQL Joins - ignacio-alorre/Spark GitHub Wiki

Spark approaches two types of cluster communication Strategy:

node-node communication strategy: Spark shuffles the data across the clusters
per-node communication strategy: Spark perform broadcast joins

Performance of Spark joins depends upon the strategy udes to tackle each scenario, which in turn relies on the size of the tables. Sort Merge join and Shuffle Hash join are the two major methods. Broadcast joins are the most preferable and efficient one because it is based on per-node communication strategy, which avoid shuffles, but it´s applicable only for a small sets of data.

Sort-Merge Join It is composed of 2 steps:

1st: Sort the datasets.
2nd: Merge the sorted data in the partition by iterating over the elements and according to the join key, join the rows which have the same value.

To accomplish ideal performance in Sort Merge Join:

Make sure the partition have been co-located. Otherwise, there will be shuffle operations to co-locate the data as it has pre-requirement that all rows having the same value for the join key should be stored in the same partition
The DataFrame should be distributed uniformly on the joining columns
To leverage parallelism the DataFrame should have an adequate number of unique keys

Shuffle Hash Join Works based on the concept of map reduce. Map through the data frames and use the values of the join column as output key. Shuffles the DataFrames based on the output keys and join the DataFrames in the reduce phase, as the rows from the different data frame with the same keys will ended up in the same machine.

Creating a Hash Table is costly, and it can be done only when the average size of a single partition is small enough to build a hash table.

Sort merge join is a very good candidate in most of times as it can spill the data to the disk and doesn’t need to hold the data in memory like its counterpart Shuffle Hash join. However, when the build size is smaller than the stream size Shuffle Hash join will outperform Sort Merge join.

The internarl paramer:

spark.sql.join.preferSortMergeJoin = false

Sources

https://medium.com/datakaresolutions/optimize-spark-sql-joins-c81b4e3ed7da