The Art of Writing Efficient Programs - pykello/pykello.github.com GitHub Wiki

Examples from elesewhere

Branch misprediction

Chapter 2, Performance Measurements

Tools

  • perf profiler
    • sudo apt install linux-perf
    • sudo yum install perf
  • google-perftools
    • sudo apt install google-perftools
    • sudo yum install google-perftools
  • Google benchmark
sudo ln -s /lib64/libprofiler.so.0.4.16 /lib64/libprofiler.so

Examples

Repo: https://github.com/PacktPublishing/The-Art-of-Writing-Efficient-Programs

First example:

$ clang++ -g -O3 -mavx2 -Wall -pedantic 01_substring_sort_a.C 01_substring_sort.C -o example && ./example

Compile & run with profiling data:

$ clang++ -g -O3 -mavx2 -Wall -pedantic 01_substring_sort_a.C 01_substring_sort.C -lprofiler -o example
$ CPUPROFILE=prof.data ./example
$ pprof --text ./example prof.data

Performance Benchmarking

Chrono timers

#include <chrono>
using std::chrono::duration_cast;
using std::chrono::milliseconds;
using std::chrono::system_clock;
  …
auto t0 = system_clock::now();
  … do some work …
auto t1 = system_clock::now();
auto delta_t = duration_cast<milliseconds>(t1 – t0);
cout << "Time: " << delta_t.count() << endl;

High resolution timers

double duration(timespec a, timespec b) {
  return a.tv_sec - b.tv_sec + 1e-9*(a.tv_nsec - b.tv_nsec);
}
   …
{
  timespec rt0, ct0, tt0;
  clock_gettime(CLOCK_REALTIME, &rt0);
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ct0);
  clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tt0);
  constexpr double X = 1e6;
  double s = 0;
  for (double x = 0; x < X; x += 0.1) s += sin(x);
  timespec rt1, ct1, tt1;
  clock_gettime(CLOCK_REALTIME, &rt1);
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ct1);
  clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tt1);
  cout << "Real time: " << duration(rt1, rt0) << "s, "
            "CPU time: " << duration(ct1, ct0) << "s, "
            "Thread time: " << duration(tt1, tt0) << "s" <<
              endl;
}

Profilers

Linux perf profiler

Uses hardware performance counters & time-based sampling. It doesn't require any instrumentation of the code.

$ perf stat ./example
Sort time: 87ms (292894 comparisons)

 Performance counter stats for './example':

             89.66 msec task-clock:u              #    0.995 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               210      page-faults:u             #    0.002 M/sec
         282136715      cycles:u                  #    3.147 GHz
        1279044290      instructions:u            #    4.53  insn per cycle
         255987321      branches:u                # 2855.090 M/sec
            361680      branch-misses:u           #    0.14% of all branches

       0.090112799 seconds time elapsed

       0.088907000 seconds user
       0.001001000 seconds sys

To get a list of events that can be counted:

$ perf list

To collect specific events:

$ perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses ./example

Detailed report:

$ perf record -c 10000 ./example
Sort time: 103ms (292894 comparisons)
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.334 MB perf.data (10583 samples) ]

$ perf report

The Google Performance profiler

Uses hardware performance counters. Requires link-time instrumentation.

$ clang++ -g -O3 -mavx2 -Wall -pedantic 01_substring_sort_a.C 01_substring_sort.C -lprofiler -o example
$ CPUPROFILE=prof.data CPUPROFILE_FREQUENCY=1000 ./example
$ pprof --text ./example prof.data

Produce call graph:

$ google-pprof --pdf ./example prof.data > prof.pdf

Micro-benchmarking

Google Benchmark

#include "benchmark/benchmark.h"
...
void BM_loop_int(benchmark::State& state) {
  /* setup is free ... */
  const unsigned int N = state.range(0);
  ...
  /* ... what is measured goes into the body of benchmarking loop */
  for (auto _ : state) {
     benchmark::DoNotOptimize(compare_int(s1, s2));
  }
  state.SetItemsProcessed(N*state.iterations());
}
BENCHMARK(BM_loop_int)->Arg(1<<20);
BENCHMARK_MAIN();
$ GBENCH_DIR=/usr/local
$ clang++ -g -O3 -mavx2 -Wall -pedantic -I$GBENCH_DIR/include 10_compare_mbm.C  $GBENCH_DIR/lib64/libbenchmark.a -pthread -lrt -lm -o benchmark
$ ./benchmark --benchmark_repetitions=10 --benchmark_report_aggregates_only=true
2022-05-10T00:22:26-04:00
Running ./benchmark
Run on (8 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 0.00, 0.00, 0.00
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------
BM_loop_int/1048576_mean       329182 ns       328729 ns           10 items_per_second=3.18979G/s
BM_loop_int/1048576_median     329079 ns       328632 ns           10 items_per_second=3.19073G/s
BM_loop_int/1048576_stddev        328 ns          323 ns           10 items_per_second=3.13026M/s
BM_loop_int/1048576_cv           0.10 %          0.10 %            10 items_per_second=0.10%

Chapter 3: CPU Architecture, Resources, and Performance

Instruction Level Parallelism

CPU can execute multiple instructions at a time:

LLVM-MCA

#define MCA_START __asm volatile("# LLVM-MCA-BEGIN");
#define MCA_END __asm volatile("# LLVM-MCA-END");
           …
         for (size_t i = 0; i < N; ++i) {
MCA_START
                 a1 += p1[i] + p2[i];
MCA_END
           }

The analyzer assumes that the selected code fragment runs in a loop.

$ clang++-11 01_superscalar.C -g -O3 -mavx2 --std=c++17 -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca-11 -mcpu=core-avx2 --timeline -o 1.txt

Data dependencies & pipelining

Pipelining

Register renaming: The register names you see in the program, are not the real register names, they are mapped by the CPU to the actual physical registers. The same name, can be mapped to different registers that all have the same size and functionality.

loop unrolling: usually done by compiler, but can be done in hardware to deal with data dependencies.

Out-of-order execution

Pipelining and branches

Many processors have some sort of conditional add or conditional move (cmove) which can make the program sequential.

⚠️ **GitHub.com Fallback** ⚠️