Effect of replacing RMW atomics with load non atomic modify store in OL critical sections - laurynas-biveinis/unodb GitHub Wiki

The initial implementation used relaxed_atomic in optimistic lock critical sections, including writer sections. This resulted in increments and decrements compiling to atomic RMW operations, where a non-atomic equivalent would suffice. Fixed by along the lines of replacing value.fetch_add(1, std::memory_order_relaxed) with value.store(value.load(std::memory_order_relaxed) + 1, std::memory_order_relaxed).

The effect of this change (baseline commit, change commit):

$ python3 ../../3rd_party/benchmark/tools/compare.py benchmarks ../../../unodb/build.release/benchmark/micro_benchmark_key_prefix ./micro_benchmark_key_prefix --benchmark_filter=olc_db --benchmark_repetitions=9
...
unpredictable_get_shared_length<unodb::olc_db>_pvalue                     0.0004          0.0004      U Test, Repetitions: 9 vs 9
unpredictable_get_shared_length<unodb::olc_db>_mean                      -0.0101         -0.0086             2             2             2             2
unpredictable_leaf_key_prefix_split<unodb::olc_db>_pvalue                 0.0004          0.0004      U Test, Repetitions: 9 vs 9
unpredictable_leaf_key_prefix_split<unodb::olc_db>_mean                  -0.0373         -0.0374            52            50            52            50
unpredictable_cut_key_prefix<unodb::olc_db>_pvalue                        0.0004          0.0004      U Test, Repetitions: 9 vs 9
unpredictable_cut_key_prefix<unodb::olc_db>_mean                         +0.0079         +0.0075            51            52            51            52
unpredictable_prepend_key_prefix<unodb::olc_db>_pvalue                    0.0004          0.0004      U Test, Repetitions: 9 vs 9
unpredictable_prepend_key_prefix<unodb::olc_db>_mean                     -0.0044         -0.0047            63            63            63            63

perf stat baseline:

 Performance counter stats for '../../../unodb/build.release/benchmark/micro_benchmark_key_prefix --benchmark_filter=olc_db':

          9,944.99 msec task-clock                #    0.998 CPUs utilized          
                13      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               198      page-faults               #    0.020 K/sec                  
    37,845,016,384      cycles                    #    3.805 GHz                      (83.33%)
    13,808,512,281      stalled-cycles-frontend   #   36.49% frontend cycles idle     (83.31%)
     6,469,997,393      stalled-cycles-backend    #   17.10% backend cycles idle      (66.68%)
    62,745,130,075      instructions              #    1.66  insn per cycle         
                                                  #    0.22  stalled cycles per insn  (83.35%)
    12,097,596,564      branches                  # 1216.451 M/sec                    (83.35%)
        37,257,659      branch-misses             #    0.31% of all branches          (83.34%)

       9.963432182 seconds time elapsed

       9.673211000 seconds user
       0.272146000 seconds sys

With the patch:

 Performance counter stats for './micro_benchmark_key_prefix --benchmark_filter=olc_db':

          9,943.62 msec task-clock                #    0.998 CPUs utilized          
                12      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               197      page-faults               #    0.020 K/sec                  
    37,840,303,986      cycles                    #    3.805 GHz                      (83.33%)
    13,430,158,776      stalled-cycles-frontend   #   35.49% frontend cycles idle     (83.31%)
     6,317,067,892      stalled-cycles-backend    #   16.69% backend cycles idle      (66.68%)
    63,558,791,901      instructions              #    1.68  insn per cycle         
                                                  #    0.21  stalled cycles per insn  (83.35%)
    12,259,000,270      branches                  # 1232.851 M/sec                    (83.35%)
        39,956,699      branch-misses             #    0.33% of all branches          (83.34%)

       9.962069399 seconds time elapsed

       9.615824000 seconds user
       0.328130000 seconds sys

Node microbenchmarks vary from 7% slowdown (Node4 ops, Node16 -> Node48 expansion) to 7% speedup (Node48 ops). micro_benchmark_olc varies from 9% slowdown (parallel get large tree 8 threads) to 32% speedup (parallel delete large tree 4 threads)

Generated assembly confirms that LOCK prefix is no longer used:

@@ -584,26 +555,28 @@
 mov    %rcx,0x90(%rsp)
 movzbl 0x90(%rsp,%r14,1),%ecx
 mov    %rax,0x18(%rbp,%rcx,8)
-lock addb $0x1,(%rdx)
+movzbl (%rdx),%eax
+add    $0x1,%eax
+mov    %al,(%rdx)
 mov    0xa8(%rsp),%rsi
 test   %rsi,%rsi