14 join algorithms - mergen-sergio/DBest GitHub Wiki
The Nested Loop Join algorithm evaluates each tuple from the outer side by searching for matching tuples on the inner side.
- Naive Strategy: Scans all tuples on the inner side for each outer-side tuple, making it inefficient.
-
Optimized Usage:
- Leverage an operator on the inner side to enhance lookup performance.
- Examples:
- Using a data node indexed by the lookup column.
- Applying a materialized operation that reorders data for efficient access.
The illustration below demonstrates two query trees.
-
Left Tree: The
movie_cast
data node is on the outer side, while theperson
data node, indexed byperson_id
, is on the inner side. This setup is efficient since the index facilitates fast lookups for each tuple.

-
Right Tree: The sides are reversed, with
movie_cast
as the inner side. This is inefficient because:- The
movie_cast
node is indexed by bothmovie_id
andperson_id
, butperson_id
is the secondary key. - Each lookup requires scanning the entire index for matches, resulting in high overhead.
- The
-
Switch the Sides: Configure the
person
data node as the inner side to leverage its index (as in the left tree). -
Add a Materialized Operation: Connect
movie_cast
to a materialized operation that offers efficient lookups (as in the next example). - Use a Hash Join: Replace the Nested Loop Join with a Hash Join for improved performance (as in the next example).
The Hash Join algorithm builds a hash table from the inner side using the join column(s) as the key. The outer side tuples are then matched against this hash table.
- Efficiency: Performs better than Nested Loop Join for larger datasets.
- Memory Usage: Requires memory to store the hash table.
-
Equivalent Process:
- Combines elements of a Nested Loop Join, Hash operator, and Projection operator.
The figure below compares two equivalent approaches to Hash Join:
- Left Tree: Directly builds a hash table from the inner side and uses it to match tuples from the outer side, avoiding repeated scans.

- Right Tree: Projects relevant columns from the inner side, builds a hash table using these columns, and uses a Nested Loop Join for lookups. The Hash operator is a materialized operation that dynamically aligns its keys with the parent node (the join operator). Both trees achieve the same result, but the left tree is a simplified representation.
The Merge Join algorithm processes both sides of the join simultaneously in a single, forward-only scan, identifying matches as they are encountered.
-
Prerequisite: Both sides must be sorted by the join column(s).
- A Sort operator can be added to ensure the correct order.
- Pitfalls: Unsuitable for unsorted data, as matches may be missed.
-
Efficiency:
- When pre-sorted:
- It is the most efficient join algorithm.
- Requires no additional memory.
- Processes each side exactly once.
- When pre-sorted:
The image below compares two query trees using Merge Join:
-
Left Tree: Joins
movie
andmovie_cast
nodes, both already sorted by the join column (movie_id
). This is an optimal setup for Merge Join.

-
Right Tree: Joins
movie_cast
andperson
nodes, but requires a Sort operator for the inner side (person
) to align tuples by the join column (person_id
). When sorting is needed, alternative join strategies may offer better performance.
By understanding the nuances of these join algorithms, their optimal scenarios, and trade-offs, you can design efficient query plans to maximize relational query performance.