Effect of replacing RMW atomics with load non atomic modify store in OL critical sections - laurynas-biveinis/unodb GitHub Wiki
The initial implementation used relaxed_atomic in optimistic lock critical sections, including writer sections. This resulted in increments and decrements compiling to atomic RMW operations, where a non-atomic equivalent would suffice. Fixed by along the lines of replacing value.fetch_add(1, std::memory_order_relaxed)
with value.store(value.load(std::memory_order_relaxed) + 1, std::memory_order_relaxed)
.
The effect of this change (baseline commit, change commit):
$ python3 ../../3rd_party/benchmark/tools/compare.py benchmarks ../../../unodb/build.release/benchmark/micro_benchmark_key_prefix ./micro_benchmark_key_prefix --benchmark_filter=olc_db --benchmark_repetitions=9
...
unpredictable_get_shared_length<unodb::olc_db>_pvalue 0.0004 0.0004 U Test, Repetitions: 9 vs 9
unpredictable_get_shared_length<unodb::olc_db>_mean -0.0101 -0.0086 2 2 2 2
unpredictable_leaf_key_prefix_split<unodb::olc_db>_pvalue 0.0004 0.0004 U Test, Repetitions: 9 vs 9
unpredictable_leaf_key_prefix_split<unodb::olc_db>_mean -0.0373 -0.0374 52 50 52 50
unpredictable_cut_key_prefix<unodb::olc_db>_pvalue 0.0004 0.0004 U Test, Repetitions: 9 vs 9
unpredictable_cut_key_prefix<unodb::olc_db>_mean +0.0079 +0.0075 51 52 51 52
unpredictable_prepend_key_prefix<unodb::olc_db>_pvalue 0.0004 0.0004 U Test, Repetitions: 9 vs 9
unpredictable_prepend_key_prefix<unodb::olc_db>_mean -0.0044 -0.0047 63 63 63 63
perf stat baseline:
Performance counter stats for '../../../unodb/build.release/benchmark/micro_benchmark_key_prefix --benchmark_filter=olc_db':
9,944.99 msec task-clock # 0.998 CPUs utilized
13 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
198 page-faults # 0.020 K/sec
37,845,016,384 cycles # 3.805 GHz (83.33%)
13,808,512,281 stalled-cycles-frontend # 36.49% frontend cycles idle (83.31%)
6,469,997,393 stalled-cycles-backend # 17.10% backend cycles idle (66.68%)
62,745,130,075 instructions # 1.66 insn per cycle
# 0.22 stalled cycles per insn (83.35%)
12,097,596,564 branches # 1216.451 M/sec (83.35%)
37,257,659 branch-misses # 0.31% of all branches (83.34%)
9.963432182 seconds time elapsed
9.673211000 seconds user
0.272146000 seconds sys
With the patch:
Performance counter stats for './micro_benchmark_key_prefix --benchmark_filter=olc_db':
9,943.62 msec task-clock # 0.998 CPUs utilized
12 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
197 page-faults # 0.020 K/sec
37,840,303,986 cycles # 3.805 GHz (83.33%)
13,430,158,776 stalled-cycles-frontend # 35.49% frontend cycles idle (83.31%)
6,317,067,892 stalled-cycles-backend # 16.69% backend cycles idle (66.68%)
63,558,791,901 instructions # 1.68 insn per cycle
# 0.21 stalled cycles per insn (83.35%)
12,259,000,270 branches # 1232.851 M/sec (83.35%)
39,956,699 branch-misses # 0.33% of all branches (83.34%)
9.962069399 seconds time elapsed
9.615824000 seconds user
0.328130000 seconds sys
Node microbenchmarks vary from 7% slowdown (Node4 ops, Node16 -> Node48 expansion) to 7% speedup (Node48 ops). micro_benchmark_olc varies from 9% slowdown (parallel get large tree 8 threads) to 32% speedup (parallel delete large tree 4 threads)
Generated assembly confirms that LOCK prefix is no longer used:
@@ -584,26 +555,28 @@
mov %rcx,0x90(%rsp)
movzbl 0x90(%rsp,%r14,1),%ecx
mov %rax,0x18(%rbp,%rcx,8)
-lock addb $0x1,(%rdx)
+movzbl (%rdx),%eax
+add $0x1,%eax
+mov %al,(%rdx)
mov 0xa8(%rsp),%rsi
test %rsi,%rsi