postgres join - ghdrako/doc_snipets GitHub Wiki
Nested Loop Join
- no parallel-aware mode
for x in table1:
for y in table2:
if x.field == y.field
issue row
else
keep doing
Nested loops are often used if one of the sides is very small and contains only a limited set of data. A nested loop is generally O(n*n)
, so it’s only efficient if one side of the join is very small.
Merge Join
If both sides of the join are sorted, the system can just take the rows from the top, see whether they match, and return them. The main requirement here is that the lists are sorted.
Merge join
Sort table 1
Sequential scan table 1
Sort table 2
Sequential scan table 2
To join these two tables (table 1 and table 2), data must be provided in sorted order. In many cases, PostgreSQL will just sort the data. However, there are other options we can use to provide the join with sorted data. One way is to consult an index, as shown in the following example:
Merge join
Index scan table 1
Index scan table 2
One side of the join, or both sides, can use sorted data coming from lower levels of the plan. If the table is accessed directly, an index is an obvious choice for this, but only if the returned result set is significantly smaller than the entire table. Otherwise, we’ll encounter almost double the overhead because we have to read the entire index and then the entire table. If the result set is a large portion of the table, a sequential scan is more efficient, especially if it’s accessed in the primary key order.
The beauty of a merge join is that it can handle a lot of data. The downside is that data has to be sorted
or taken from an index at some point.
Sorting is O(n * log(n))
. Therefore, sorting 300 million rows to perform the join isn’t attractive either.
hash join
Think of Hash Join as a kind of Nested Loop Join that builds its own index up front every time, which makes it good for joins against things you don't have an index for or joins against whole tables where sequential access beats random access.
Hash join
Sequential scan table 1
Sequential scan table 2
Both sides can be hashed, and the hash keys can be compared, leaving us with the result of the join.
The problem here is that all of the values have to be hashed and stored somewhere.
parallel hash join 11+
- https://www.enterprisedb.com/postgres-tutorials/parallel-hash-joins-postgresql-explained
- Parallel Join Algorithms (Hashing) (or just the slides)
enable_parallel_hash = [on|off]