Timing software - fordsfords/fordsfords.github.io GitHub Wiki
I've been updating some of my earlier work on timing software - i.e. taking timestamps in code to determine the time duration between two points. I want a nanosecond resolution timer module that is portable across OSes.
For now, this is just random notes that might someday be made into a blog post.
https://github.com/fordsfords/nstm
- Microsoft wants you to use QueryPerformanceTimer(), which has 100 ns resolution.
- CLOCK_MONOTONIC_RAW totally sucks on CentOS (over 10x the execution time of CLOCK_MONOTONIC), and probably on other Linuxes too (Bill Torpey agrees).
- Which is too bad since _RAW is the only one on Mac that resolves to the nanosecond.
- CLOCK_REALTIME is fast on Linux and Mac, but on Mac only resolves to microsecond. Good for long-duration timings since it responds to NNTP, but a non-starter for short-duration timings.
- MacOS is not a great OS for real-time, jitter-free operation. For example, although it's hard to get the latest info, it looks like CPU affinity is not a thing with MacOS.
(Big Sur, 11.4)
./tst.sh start_ns=463050112063000 t1==463050112063000 usleep(100)=136000, raw=135567, rt=136000 null=0, raw=184, rt=0 10M nstm_get calls=758895000, raw=758881516, rt=758895000 --- uses monotonic 10M nstm_get raw calls=396588000, raw=396580244, rt=396588000 --- uses monotonic raw 10M nstm_get rt calls=309383000, raw=309377980, rt=309383000 --- uses monotonic realtime
(Catalina, 10.15.7)
./tst.sh start_ns=4927455088923000 t1==4927455088924000 usleep(100)=185000, raw=185738, rt=186000 null=11000, raw=10370, rt=11000 10 nstm_get calls=819944000, raw=819919143, rt=819943000 10 nstm_get raw calls=496523000, raw=496508132, rt=496523000 10 nstm_get rt calls=394989000, raw=394977923, rt=394990000
(/etc/centos-release: CentOS Linux release 8.0.1905 (Core))
./tst.sh start_ns=0, start_ts.tv_sec=1456936 t1==870168008 usleep(100)=155356, raw=155353, rt=155327 null=386, raw=351, rt=363 10M nstm_get calls=196773627, raw=196772185, rt=196773879 10M nstm_get raw calls=2630356848, raw=2630335694, rt=2630356590 10M nstm_get rt calls=196032770, raw=196031272, rt=196032889
(/etc/centos-release: CentOS Linux release 7.3.1611 (Core))
./tst.sh start_ns=0, start_ts.tv_sec=1456911 t1==223829810 usleep(100)=160586, raw=160512, rt=160388 null=875, raw=848, rt=829 10M nstm_get calls=295498507, raw=295495649, rt=295498291 10M nstm_get raw calls=2485423913, raw=2485399966, rt=2485423714 10M nstm_get rt calls=239707258, raw=239705043, rt=239707424
frequency=10000000, start_ticks=3674900702943 t1==1600, t1_raw=1200, t1_rt=900 Sleep(1)=8072500, raw=8072400, rt=8072600 null=500, raw=400, rt=300 10M nstm_get calls=777035200, raw=777035100, rt=777035200 10M nstm_get raw calls=763979100, raw=763979300, rt=763979200 10M nstm_get rt calls=786151200, raw=786151100, rt=786151200 10M nstm_get best calls=771959900, raw=771960000, rt=771960000
I added a histogram for timing of 1,000,000 calls. The timing code looks like this:
t1 = nstm_get(nstm_best); (void)nstm_get(nstm_best); t3 = nstm_get(nstm_best); diff = t3-t1;Note that "diff" includes execution time for the middle "nstm_get" but also the second ~half of the first "nstm_get" and the first ~half of the third "nstm_get". Most of the time, all three "nstm_get" calls will take the same amount of time, so that you can get the execution time of a single call by dividing "diff" by 2. But the whole point of a histogram is to measure the outliers, at which point the calculation becomes less obvious. For example, on my Macbook Air, the average execution time for "nstm_get" is 49 ns. But the maximum value for "diff" during one run was 32,917. If we divide by 2, we get 16458. Is that the maximum execution time for one call? Probably not - you typically don't get two big outliers in a row (see #series). More likely, two of the three calls took about 50-ish ns and the other took 32,867 ns, although which of the three did is not knowable. But we can't know for sure; there might have been two extra-long calls in rapid succession, although probably not equally extra long.
So when you look at printout of the whole histogram, you should subtract the average run-time of "nstm_get" from each histogram bucket to get the probable run-time of the measured "nstm_get". Near the bottom, that will cut the value about in half, you *don't* want to think of it as a "divide by 2" operation. Think of it as a "probably subtract the average" operation.
The histogramming part looks like this:
#define HISTO_SZ 10000 int histo[HISTO_SZ]; ... if (diff > max_diff) { max_diff = diff; } if (diff < min_diff) { min_diff = diff; } diff = diff / UINT64_C(10); if (diff >= HISTO_SZ) { diff = (HISTO_SZ - 1); } histo[diff]++;Note the division by 10. Each histogram bucket is 10 ns. And the last bucket (HISTO_SZ - 1) contains everything at or above the last bucket.
When printing the histogram, I don't print the zero buckets.
(Catalina, 10.15.7)
$ ./tst.sh start_ns=5513003162797000 t1==5513003162798000, t1_raw=5512916866717057, t1_rt=1628427798329699000 Sleep(1)=1421000, raw=1421504, rt=1422000 null=1000, raw=443, rt=0 10M nstm_get calls=818184000, raw=818159806, rt=818185000 10M nstm_get raw calls=492469000, raw=492454184, rt=492469000 10M nstm_get rt calls=401485000, raw=401472622, rt=401484000 10M nstm_get best calls=491937000, raw=491922356, rt=491937000 histo[9]=530486 histo[10]=424741 histo[11]=10387 histo[12]=31368 histo[13]=1543 histo[14]=677 histo[15]=226 histo[16]=260 histo[17]=55 histo[18]=16 histo[19]=25 histo[20]=36 histo[21]=14 histo[22]=15 histo[23]=8 histo[24]=8 histo[25]=6 histo[26]=13 histo[27]=10 histo[28]=8 histo[29]=4 histo[30]=5 histo[32]=1 ... histo[3291]=1 max_diff=32917, min_diff=93
This just repeatedly records 100 individual calls to "nstm_get" and looks at the diffs. When it finds a "big outlier" (defined to be 1/3 of the largest histogram timing), it prints the series diffs.
(Catalina, 10.15.7)
Here's a sample excerpt of the series portion of the output:
series diff 1=58 series diff 2=50 series diff 3=50 series diff 4=60 series diff 5=50 series diff 6=50 series diff 7=59 series diff 8=51 series diff 9=50 series diff 10=59 series diff 11=50 ... series diff 32=51 series diff 33=50 series diff 34=8581 series diff 35=53 series diff 36=61 series diff 37=196 series diff 38=48 ...
Here's a shorter excerpt from another run:
series diff 27=65 series diff 28=74 series diff 29=34207 series diff 30=284 series diff 31=78 series diff 32=65As you can see, it is *not* correct to simply assume that an outlier is isolated, sandwiched between two average values.
Here's another:
series diff 61=50 series diff 62=50 series diff 63=60 series diff 64=50 series diff 65=50 series diff 66=58844 series diff 67=481 series diff 68=103 series diff 69=80 series diff 70=83 series diff 71=72 series diff 72=91 series diff 73=65The values before the outlier seems to be noticeably lower than the values after the outlier (this is true over the entire series). One possible (speculative) explanation: maybe the thread switched CPUs at that point? (I've seen a different run that did the opposite: started out at the higher values, then after the big outlier settled to the smaller values.)