postgres join - ghdrako/doc_snipets GitHub Wiki

Nested Loop Join

  • no parallel-aware mode
 for x in table1:
 for y in table2:
 if x.field == y.field
   issue row
 else
   keep doing

Nested loops are often used if one of the sides is very small and contains only a limited set of data. A nested loop is generally O(n*n), so it’s only efficient if one side of the join is very small.

Merge Join

If both sides of the join are sorted, the system can just take the rows from the top, see whether they match, and return them. The main requirement here is that the lists are sorted.

 Merge join
 Sort table 1
 Sequential scan table 1
 Sort table 2
 Sequential scan table 2

To join these two tables (table 1 and table 2), data must be provided in sorted order. In many cases, PostgreSQL will just sort the data. However, there are other options we can use to provide the join with sorted data. One way is to consult an index, as shown in the following example:

 Merge join
 Index scan table 1
 Index scan table 2

One side of the join, or both sides, can use sorted data coming from lower levels of the plan. If the table is accessed directly, an index is an obvious choice for this, but only if the returned result set is significantly smaller than the entire table. Otherwise, we’ll encounter almost double the overhead because we have to read the entire index and then the entire table. If the result set is a large portion of the table, a sequential scan is more efficient, especially if it’s accessed in the primary key order.

The beauty of a merge join is that it can handle a lot of data. The downside is that data has to be sorted or taken from an index at some point. Sorting is O(n * log(n)). Therefore, sorting 300 million rows to perform the join isn’t attractive either.

hash join

Think of Hash Join as a kind of Nested Loop Join that builds its own index up front every time, which makes it good for joins against things you don't have an index for or joins against whole tables where sequential access beats random access.

 Hash join
 Sequential scan table 1
 Sequential scan table 2
 Both sides can be hashed, and the hash keys can be compared, leaving us with the result of the join. 

The problem here is that all of the values have to be hashed and stored somewhere.

parallel hash join 11+
enable_parallel_hash = [on|off]