Effect of removing branching on children_count in N4 insert_position_search - laurynas-biveinis/unodb GitHub Wiki

baseline commit, patch

Benchmark micro_benchmark_n4 --benchmark_filter='(insert|delete).*unodb::db' shows variance of 5% slowdown to 3% speedup. Disassembly confirms that comparison is removed, together with a lot of code reshuffling.

Perf stat for the same, baseline:

 Performance counter stats for './micro_benchmark_n4 --benchmark_filter=(insert|delete).*unodb::db':

         64,525.63 msec task-clock                #    1.000 CPUs utilized          
                82      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           760,108      page-faults               #    0.012 M/sec                  
   161,483,904,044      cycles                    #    2.503 GHz                      (83.33%)
    55,131,401,371      stalled-cycles-frontend   #   34.14% frontend cycles idle     (83.33%)
    27,545,377,239      stalled-cycles-backend    #   17.06% backend cycles idle      (66.67%)
   320,310,195,225      instructions              #    1.98  insn per cycle         
                                                  #    0.17  stalled cycles per insn  (83.34%)
    58,357,417,306      branches                  #  904.407 M/sec                    (83.34%)
       354,373,579      branch-misses             #    0.61% of all branches          (83.33%)

      64.542931348 seconds time elapsed

      61.789857000 seconds user
       2.736259000 seconds sys

Patch:

 Performance counter stats for './micro_benchmark_n4 --benchmark_filter=(insert|delete).*unodb::db':

         64,861.68 msec task-clock                #    1.000 CPUs utilized          
               102      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           623,909      page-faults               #    0.010 M/sec                  
   162,326,540,442      cycles                    #    2.503 GHz                      (83.33%)
    54,952,251,076      stalled-cycles-frontend   #   33.85% frontend cycles idle     (83.33%)
    27,718,666,852      stalled-cycles-backend    #   17.08% backend cycles idle      (66.67%)
   319,216,689,125      instructions              #    1.97  insn per cycle         
                                                  #    0.17  stalled cycles per insn  (83.34%)
    58,015,343,373      branches                  #  894.447 M/sec                    (83.34%)
       362,515,162      branch-misses             #    0.62% of all branches          (83.33%)

      64.879208397 seconds time elapsed

      62.094087000 seconds user
       2.768271000 seconds sys