Effect of removing is_full dispatch on add node path - laurynas-biveinis/unodb GitHub Wiki
Previously, the internal node add path dispatched twice: once for is_full
call, then for add
call. Avoid this by moving the node fullness check into add
, and making it return whether the add was successful or the node was full.
Benchmarks filtered with (grow|add).*unodb::db
, except for Node4 filtered with insert.*unodb::db
.
-
micro_benchmark_node4
: 4% slowdown (minimal_node4_sequential_insert<unodb::db>/16
) to 3% speedup (full_node4_sequential_insert<unodb::db>/32768
) -
micro_benchmark_node16
: 1% slowdown (grow_node4_to_node16_sequentially<unodb::db>/64
) to 3% speedup (grow_node4_to_node16_randomly<unodb::db>/64
) -
micro_benchmark_node48
: 1% slowdown (grow_node16_to_node48_randomly<unodb::db>/64
) to 4% speedup (node48_random_add<unodb::db>/8
) -
micro_benchmark_node256
: 4% slowdown (grow_node48_to_node256_randomly<unodb::db>/64
) to 6% speedup (node256_sequential_add<unodb::db>/1
)
Valgrind Cachegrind does not support the idea that the patch actually executes fewer branches. Baseline:
node256_sequential_add<unodb::db>/1 1542 us 1541 us 454 items_per_second=134.309k/s size=30.4229k
...
==1566345== Branches: 16,661,392 (14,966,919 cond + 1,694,473 ind)
Patch
node256_sequential_add<unodb::db>/1 1556 us 1556 us 450 items_per_second=133.066k/s size=30.4229k
...
==1566348== Branches: 16,549,815 (14,866,745 cond + 1,683,070 ind)
With the patch we did ~99% of baseline iterations using ~99% of baseline branch instructions.
Disassembly shows that baseline already merged is_full
dispatch with add
dispatch, although the codegen is different.
I already established by disassembly that the patch does do its original intent, but nevertheless it perturbs codegen. So let's look at that.
Baseline now includes Node256 is_full optimization. Baseline, patch.
Benchmarks filtered with (grow|add).*unodb::db
, except for Node4 filtered with insert.*unodb::db
:
-
micro_benchmark_node4
: 5% slowdown (minimal_node4_sequential_insert<unodb::db>/64
) to 3% speedup (full_node4_sequential_insert<unodb::db>/32768
) -
micro_benchmark_node16
: 6% slowdown (node16_sequential_add<unodb::db>/4096
) to 4% speedup (grow_node4_to_node16_randomly<unodb::db>/20
) -
micro_benchmark_node48
: 1% slowdown (grow_node16_to_node48_sequentially<unodb::db>/8
) to 4% speedup (node48_random_add<unodb::db>/8
) -
micro_benchmark_node256
: 3% slowdown (grow_node48_to_node256_randomly<unodb::db>/64
) to 2% speedup (node256_sequential_add<unodb::db>/1
)
Same benchmarks, with filter adjusted for olc_db
:
-
micro_benchmark_node4
: 2% slowdown (full_node4_sequential_insert<unodb::olc_db>/100
) to 10% speedup (full_node4_random_insert<unodb::olc_db>/512
) -
micro_benchmark_node16
: 5% slowdown (node16_random_add<unodb::olc_db>/512
) to 4% speedup (node16_sequential_add<unodb::olc_db>/4096
) -
micro_benchmark_node48
: 3% slowdown (node48_random_add<unodb::olc_db>/8
) to 2% speedup (grow_node16_to_node48_sequentially<unodb::olc_db>/8
) -
micro_benchmark_node256
: 3% slowdown (grow_node48_to_node256_sequentially<unodb::olc_db>/64
) to 1% speedup (grow_node48_to_node256_sequentially<unodb::olc_db>/2
)
Disassembly comparison shows that, in the case of OLC, one set of node type comparisons is indeed removed with the patch.