Timing software - fordsfords/fordsfords.github.io GitHub Wiki

I've been updating some of my earlier work on timing software - i.e. taking timestamps in code to determine the time duration between two points. I want a nanosecond resolution timer module that is portable across OSes.

For now, this is just random notes that might someday be made into a blog post.

https://github.com/fordsfords/nstm

Table of Contents

Conclusions

  • Microsoft wants you to use QueryPerformanceTimer(), which has 100 ns resolution.
  • CLOCK_MONOTONIC_RAW totally sucks on CentOS (over 10x the execution time of CLOCK_MONOTONIC), and probably on other Linuxes too (Bill Torpey agrees).
    • Which is too bad since _RAW is the only one on Mac that resolves to the nanosecond.
  • CLOCK_REALTIME is fast on Linux and Mac, but on Mac only resolves to microsecond. Good for long-duration timings since it responds to NNTP, but a non-starter for short-duration timings.
  • MacOS is not a great OS for real-time, jitter-free operation. For example, although it's hard to get the latest info, it looks like CPU affinity is not a thing with MacOS.

Basic Timings

Macbook Pro, 2017, 3.1 GHz Dual=Core Intel Core i5

(Big Sur, 11.4)

./tst.sh
start_ns=463050112063000
t1==463050112063000
usleep(100)=136000, raw=135567, rt=136000
null=0, raw=184, rt=0
10M nstm_get calls=758895000, raw=758881516, rt=758895000        --- uses monotonic
10M nstm_get raw calls=396588000, raw=396580244, rt=396588000    --- uses monotonic raw
10M nstm_get rt calls=309383000, raw=309377980, rt=309383000     --- uses monotonic realtime

Macbook Air, 2015, 1.6 GHz Dual-Core Intel Core i5

(Catalina, 10.15.7)

./tst.sh
start_ns=4927455088923000
t1==4927455088924000
usleep(100)=185000, raw=185738, rt=186000
null=11000, raw=10370, rt=11000
10 nstm_get calls=819944000, raw=819919143, rt=819943000
10 nstm_get raw calls=496523000, raw=496508132, rt=496523000
10 nstm_get rt calls=394989000, raw=394977923, rt=394990000

Smash, Dell, 1.38 GHz Intel

(/etc/centos-release: CentOS Linux release 8.0.1905 (Core))

./tst.sh
start_ns=0, start_ts.tv_sec=1456936
t1==870168008
usleep(100)=155356, raw=155353, rt=155327
null=386, raw=351, rt=363
10M nstm_get calls=196773627, raw=196772185, rt=196773879
10M nstm_get raw calls=2630356848, raw=2630335694, rt=2630356590
10M nstm_get rt calls=196032770, raw=196031272, rt=196032889

Crush, Dell, 1.38 GHz Intel

(/etc/centos-release: CentOS Linux release 7.3.1611 (Core))

./tst.sh
start_ns=0, start_ts.tv_sec=1456911
t1==223829810
usleep(100)=160586, raw=160512, rt=160388
null=875, raw=848, rt=829
10M nstm_get calls=295498507, raw=295495649, rt=295498291
10M nstm_get raw calls=2485423913, raw=2485399966, rt=2485423714
10M nstm_get rt calls=239707258, raw=239705043, rt=239707424

Windows 10 v 20H2, 2.90 GHz Intel Core i7-3520M

frequency=10000000, start_ticks=3674900702943
t1==1600, t1_raw=1200, t1_rt=900
Sleep(1)=8072500, raw=8072400, rt=8072600
null=500, raw=400, rt=300
10M nstm_get calls=777035200, raw=777035100, rt=777035200
10M nstm_get raw calls=763979100, raw=763979300, rt=763979200
10M nstm_get rt calls=786151200, raw=786151100, rt=786151200
10M nstm_get best calls=771959900, raw=771960000, rt=771960000

Histogram

I added a histogram for timing of 1,000,000 calls. The timing code looks like this:

      t1 = nstm_get(nstm_best);
      (void)nstm_get(nstm_best);
      t3 = nstm_get(nstm_best);
      diff = t3-t1;
Note that "diff" includes execution time for the middle "nstm_get" but also the second ~half of the first "nstm_get" and the first ~half of the third "nstm_get". Most of the time, all three "nstm_get" calls will take the same amount of time, so that you can get the execution time of a single call by dividing "diff" by 2. But the whole point of a histogram is to measure the outliers, at which point the calculation becomes less obvious. For example, on my Macbook Air, the average execution time for "nstm_get" is 49 ns. But the maximum value for "diff" during one run was 32,917. If we divide by 2, we get 16458. Is that the maximum execution time for one call? Probably not - you typically don't get two big outliers in a row (see #series). More likely, two of the three calls took about 50-ish ns and the other took 32,867 ns, although which of the three did is not knowable. But we can't know for sure; there might have been two extra-long calls in rapid succession, although probably not equally extra long.

So when you look at printout of the whole histogram, you should subtract the average run-time of "nstm_get" from each histogram bucket to get the probable run-time of the measured "nstm_get". Near the bottom, that will cut the value about in half, you *don't* want to think of it as a "divide by 2" operation. Think of it as a "probably subtract the average" operation.

The histogramming part looks like this:

  #define HISTO_SZ 10000
  int histo[HISTO_SZ];
...
      if (diff > max_diff) {
        max_diff = diff;
      }
      if (diff < min_diff) {
        min_diff = diff;
      }
      diff = diff / UINT64_C(10);
      if (diff >= HISTO_SZ) {
        diff = (HISTO_SZ - 1);
      }
      histo[diff]++;
Note the division by 10. Each histogram bucket is 10 ns. And the last bucket (HISTO_SZ - 1) contains everything at or above the last bucket.

When printing the histogram, I don't print the zero buckets.

Macbook Air, 2015, 1.6 GHz Dual-Core Intel Core i5

(Catalina, 10.15.7)

$ ./tst.sh
start_ns=5513003162797000
t1==5513003162798000, t1_raw=5512916866717057, t1_rt=1628427798329699000
Sleep(1)=1421000, raw=1421504, rt=1422000
null=1000, raw=443, rt=0
10M nstm_get calls=818184000, raw=818159806, rt=818185000
10M nstm_get raw calls=492469000, raw=492454184, rt=492469000
10M nstm_get rt calls=401485000, raw=401472622, rt=401484000
10M nstm_get best calls=491937000, raw=491922356, rt=491937000
histo[9]=530486
histo[10]=424741
histo[11]=10387
histo[12]=31368
histo[13]=1543
histo[14]=677
histo[15]=226
histo[16]=260
histo[17]=55
histo[18]=16
histo[19]=25
histo[20]=36
histo[21]=14
histo[22]=15
histo[23]=8
histo[24]=8
histo[25]=6
histo[26]=13
histo[27]=10
histo[28]=8
histo[29]=4
histo[30]=5
histo[32]=1
...
histo[3291]=1
max_diff=32917, min_diff=93

Series

This just repeatedly records 100 individual calls to "nstm_get" and looks at the diffs. When it finds a "big outlier" (defined to be 1/3 of the largest histogram timing), it prints the series diffs.

Macbook Air, 2015, 1.6 GHz Dual-Core Intel Core i5

(Catalina, 10.15.7)

Here's a sample excerpt of the series portion of the output:

series diff 1=58
series diff 2=50
series diff 3=50
series diff 4=60
series diff 5=50
series diff 6=50
series diff 7=59
series diff 8=51
series diff 9=50
series diff 10=59
series diff 11=50
...
series diff 32=51
series diff 33=50
series diff 34=8581
series diff 35=53
series diff 36=61
series diff 37=196
series diff 38=48
...

Here's a shorter excerpt from another run:

series diff 27=65
series diff 28=74
series diff 29=34207
series diff 30=284
series diff 31=78
series diff 32=65
As you can see, it is *not* correct to simply assume that an outlier is isolated, sandwiched between two average values.

Here's another:

series diff 61=50
series diff 62=50
series diff 63=60
series diff 64=50
series diff 65=50
series diff 66=58844
series diff 67=481
series diff 68=103
series diff 69=80
series diff 70=83
series diff 71=72
series diff 72=91
series diff 73=65
The values before the outlier seems to be noticeably lower than the values after the outlier (this is true over the entire series). One possible (speculative) explanation: maybe the thread switched CPUs at that point? (I've seen a different run that did the opposite: started out at the higher values, then after the big outlier settled to the smaller values.)
⚠️ **GitHub.com Fallback** ⚠️