Performance of LLVM 12 vs LLVM 11 - laurynas-biveinis/unodb GitHub Wiki
baseline commit, patch - essentially the patch commit compiled twice with different version.
Filtered for unodb::db
:
-
micro_benchmark_key_prefix
: 3% slowdown (unpredictable_leaf_key_prefix_split
) to 5% speedup (unpredictable_prepend_key_prefix
) -
micro_benchmark_n4
: 12% slowdown (full_n4_to_minimal_random_delete<unodb::db>/4096
) to 5% speedup (shrink_node16_to_n4_randomly<unodb::db>/25
) -
micro_benchmark_n16
: 4% slowdown (full_n16_tree_sequential_delete<unodb::db>/32768
) to 8% speedup (shrink_n48_to_n16_randomly<unodb::db>/64
) -
micro_benchmark_n48
: 6% slowdown (n48_random_add<unodb::db>/8
) to 4% speedup (grow_n16_to_n48_sequentially<unodb::db>/64
) -
micro_benchmark_n256
: 57% slowdown (full_n256_tree_full_scan<unodb::db>/512
) to 2% speedup (grow_n48_to_n256_sequentially<unodb::db>/2
)
The 57% regression on full_n256_tree_full_scan<unodb::db>/512
is interesting. Baseline perf stat
:
Performance counter stats for '../../../unodb/build.llvm-11-release/benchmark/micro_benchmark_n256 --benchmark_filter=full_n256_tree_full_scan<unodb::db>/512 --benchmark_repetitions=9':
6,415.96 msec task-clock # 0.997 CPUs utilized
11 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
188 page-faults # 0.029 K/sec
16,062,708,519 cycles # 2.504 GHz (83.31%)
6,642,424,020 stalled-cycles-frontend # 41.35% frontend cycles idle (83.30%)
685,241,251 stalled-cycles-backend # 4.27% backend cycles idle (66.71%)
34,055,727,883 instructions # 2.12 insn per cycle
# 0.20 stalled cycles per insn (83.35%)
5,988,864,746 branches # 933.432 M/sec (83.35%)
792,289 branch-misses # 0.01% of all branches (83.32%)
LLVM 11:
Performance counter stats for './micro_benchmark_n256 --benchmark_filter=full_n256_tree_full_scan<unodb::db>/512 --benchmark_repetitions=9':
6,744.40 msec task-clock # 0.997 CPUs utilized
9 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
188 page-faults # 0.028 K/sec
16,885,018,121 cycles # 2.504 GHz (83.35%)
6,222,242,905 stalled-cycles-frontend # 36.85% frontend cycles idle (83.33%)
2,455,810,872 stalled-cycles-backend # 14.54% backend cycles idle (66.67%)
22,552,792,360 instructions # 1.34 insn per cycle
# 0.28 stalled cycles per insn (83.33%)
3,975,874,795 branches # 589.507 M/sec (83.33%)
249,002,721 branch-misses # 6.26% of all branches (83.31%)
showing a 300x increase in branch mispredictions. Assembly diff for unodb::db::get
shows nearly no differences except for jump target addresses and one SHR/MOV swap. Thus likely it is an unfortunate target address distribution for the branch predictor.
Filtered for unodb::olc_db
:
-
micro_benchmark_n4
: 2% slowdown (shrink_node16_to_n4_randomly<unodb::olc_db>/16383
) to 3% speedup (full_n4_sequential_delete<unodb::olc_db>/4096
) -
micro_benchmark_n16
: 6% slowdown (full_n16_tree_sequential_delete<unodb::olc_db>/246000
) to 3% speedup (shrink_n48_to_n16_randomly<unodb::olc_db>/4
) -
micro_benchmark_n48
: 26% slowdown (n48_sequential_add<unodb::olc_db>/64
) to 10% speedup (full_n48_tree_random_delete<unodb::olc_db>/4096
) -
micro_benchmark_n256
: 3% slowdown (grow_n48_to_n256_sequentially<unodb::olc_db>/512
) to 5% speedup (full_n256_tree_sequential_delete<unodb::olc_db>/196608
).
For n48_sequential_add<unodb::olc_db>/64
, again we get a 5x increase in branch mispredictions.