Performance of LLVM 12 vs LLVM 11 - laurynas-biveinis/unodb GitHub Wiki

baseline commit, patch - essentially the patch commit compiled twice with different version.

Filtered for unodb::db:

  • micro_benchmark_key_prefix: 3% slowdown (unpredictable_leaf_key_prefix_split) to 5% speedup (unpredictable_prepend_key_prefix)
  • micro_benchmark_n4: 12% slowdown (full_n4_to_minimal_random_delete<unodb::db>/4096) to 5% speedup (shrink_node16_to_n4_randomly<unodb::db>/25)
  • micro_benchmark_n16: 4% slowdown (full_n16_tree_sequential_delete<unodb::db>/32768) to 8% speedup (shrink_n48_to_n16_randomly<unodb::db>/64)
  • micro_benchmark_n48: 6% slowdown (n48_random_add<unodb::db>/8) to 4% speedup (grow_n16_to_n48_sequentially<unodb::db>/64)
  • micro_benchmark_n256: 57% slowdown (full_n256_tree_full_scan<unodb::db>/512) to 2% speedup (grow_n48_to_n256_sequentially<unodb::db>/2)

The 57% regression on full_n256_tree_full_scan<unodb::db>/512 is interesting. Baseline perf stat:

 Performance counter stats for '../../../unodb/build.llvm-11-release/benchmark/micro_benchmark_n256 --benchmark_filter=full_n256_tree_full_scan<unodb::db>/512 --benchmark_repetitions=9':

          6,415.96 msec task-clock                #    0.997 CPUs utilized          
                11      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               188      page-faults               #    0.029 K/sec                  
    16,062,708,519      cycles                    #    2.504 GHz                      (83.31%)
     6,642,424,020      stalled-cycles-frontend   #   41.35% frontend cycles idle     (83.30%)
       685,241,251      stalled-cycles-backend    #    4.27% backend cycles idle      (66.71%)
    34,055,727,883      instructions              #    2.12  insn per cycle         
                                                  #    0.20  stalled cycles per insn  (83.35%)
     5,988,864,746      branches                  #  933.432 M/sec                    (83.35%)
           792,289      branch-misses             #    0.01% of all branches          (83.32%)

LLVM 11:

 Performance counter stats for './micro_benchmark_n256 --benchmark_filter=full_n256_tree_full_scan<unodb::db>/512 --benchmark_repetitions=9':

          6,744.40 msec task-clock                #    0.997 CPUs utilized          
                 9      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               188      page-faults               #    0.028 K/sec                  
    16,885,018,121      cycles                    #    2.504 GHz                      (83.35%)
     6,222,242,905      stalled-cycles-frontend   #   36.85% frontend cycles idle     (83.33%)
     2,455,810,872      stalled-cycles-backend    #   14.54% backend cycles idle      (66.67%)
    22,552,792,360      instructions              #    1.34  insn per cycle         
                                                  #    0.28  stalled cycles per insn  (83.33%)
     3,975,874,795      branches                  #  589.507 M/sec                    (83.33%)
       249,002,721      branch-misses             #    6.26% of all branches          (83.31%)

showing a 300x increase in branch mispredictions. Assembly diff for unodb::db::get shows nearly no differences except for jump target addresses and one SHR/MOV swap. Thus likely it is an unfortunate target address distribution for the branch predictor.

Filtered for unodb::olc_db:

  • micro_benchmark_n4: 2% slowdown (shrink_node16_to_n4_randomly<unodb::olc_db>/16383) to 3% speedup (full_n4_sequential_delete<unodb::olc_db>/4096)
  • micro_benchmark_n16: 6% slowdown (full_n16_tree_sequential_delete<unodb::olc_db>/246000) to 3% speedup (shrink_n48_to_n16_randomly<unodb::olc_db>/4 )
  • micro_benchmark_n48: 26% slowdown (n48_sequential_add<unodb::olc_db>/64) to 10% speedup (full_n48_tree_random_delete<unodb::olc_db>/4096 )
  • micro_benchmark_n256: 3% slowdown (grow_n48_to_n256_sequentially<unodb::olc_db>/512) to 5% speedup (full_n256_tree_sequential_delete<unodb::olc_db>/196608).

For n48_sequential_add<unodb::olc_db>/64, again we get a 5x increase in branch mispredictions.

⚠️ **GitHub.com Fallback** ⚠️