Effect of removing is_full dispatch on add node path - laurynas-biveinis/unodb GitHub Wiki

Previously, the internal node add path dispatched twice: once for is_full call, then for add call. Avoid this by moving the node fullness check into add, and making it return whether the add was successful or the node was full.

1st try

Baseline commit, patch

Benchmarks filtered with (grow|add).*unodb::db, except for Node4 filtered with insert.*unodb::db.

  • micro_benchmark_node4: 4% slowdown (minimal_node4_sequential_insert<unodb::db>/16) to 3% speedup (full_node4_sequential_insert<unodb::db>/32768)
  • micro_benchmark_node16: 1% slowdown (grow_node4_to_node16_sequentially<unodb::db>/64) to 3% speedup (grow_node4_to_node16_randomly<unodb::db>/64)
  • micro_benchmark_node48: 1% slowdown (grow_node16_to_node48_randomly<unodb::db>/64) to 4% speedup (node48_random_add<unodb::db>/8)
  • micro_benchmark_node256: 4% slowdown (grow_node48_to_node256_randomly<unodb::db>/64) to 6% speedup (node256_sequential_add<unodb::db>/1)

Valgrind Cachegrind does not support the idea that the patch actually executes fewer branches. Baseline:

node256_sequential_add<unodb::db>/1       1542 us         1541 us          454 items_per_second=134.309k/s size=30.4229k
...
==1566345== Branches:       16,661,392  (14,966,919 cond +  1,694,473 ind)

Patch

node256_sequential_add<unodb::db>/1       1556 us         1556 us          450 items_per_second=133.066k/s size=30.4229k
...
==1566348== Branches:       16,549,815  (14,866,745 cond +  1,683,070 ind)

With the patch we did ~99% of baseline iterations using ~99% of baseline branch instructions.

Disassembly shows that baseline already merged is_full dispatch with add dispatch, although the codegen is different.

2nd try

I already established by disassembly that the patch does do its original intent, but nevertheless it perturbs codegen. So let's look at that.

Baseline now includes Node256 is_full optimization. Baseline, patch.

Benchmarks filtered with (grow|add).*unodb::db, except for Node4 filtered with insert.*unodb::db:

  • micro_benchmark_node4: 5% slowdown (minimal_node4_sequential_insert<unodb::db>/64) to 3% speedup (full_node4_sequential_insert<unodb::db>/32768)
  • micro_benchmark_node16: 6% slowdown (node16_sequential_add<unodb::db>/4096) to 4% speedup (grow_node4_to_node16_randomly<unodb::db>/20)
  • micro_benchmark_node48: 1% slowdown (grow_node16_to_node48_sequentially<unodb::db>/8) to 4% speedup (node48_random_add<unodb::db>/8)
  • micro_benchmark_node256: 3% slowdown (grow_node48_to_node256_randomly<unodb::db>/64) to 2% speedup (node256_sequential_add<unodb::db>/1)

Same benchmarks, with filter adjusted for olc_db:

  • micro_benchmark_node4: 2% slowdown (full_node4_sequential_insert<unodb::olc_db>/100) to 10% speedup (full_node4_random_insert<unodb::olc_db>/512)
  • micro_benchmark_node16: 5% slowdown (node16_random_add<unodb::olc_db>/512) to 4% speedup (node16_sequential_add<unodb::olc_db>/4096)
  • micro_benchmark_node48: 3% slowdown (node48_random_add<unodb::olc_db>/8) to 2% speedup (grow_node16_to_node48_sequentially<unodb::olc_db>/8)
  • micro_benchmark_node256: 3% slowdown (grow_node48_to_node256_sequentially<unodb::olc_db>/64) to 1% speedup (grow_node48_to_node256_sequentially<unodb::olc_db>/2)

Disassembly comparison shows that, in the case of OLC, one set of node type comparisons is indeed removed with the patch.

⚠️ **GitHub.com Fallback** ⚠️