How does fio keep time - vincentkfu/fio-blog GitHub Wiki
Adam Manzanares @nmtadam provided helpful feedback on an earlier draft of this post. All remaining errors and omissions are mine of course.
Fio collects multiple timestamps for every single I/O (see latency blog
post).
Fio also keeps track of time for many other features (runtime, bandwidth/IOPS
logging, steady state detection, etc). A simple approach would query the
operating system for each timestamp, but system calls are expensive and the
cumulative cost could be enormous given how frequently fio records timestamps.
So when possible, fio uses the CPU clock for time keeping. For modern x86-64
CPUs this is done via the RDTSC
instruction which returns the number of CPU cycles elapsed since the processor
was reset.
How does fio actually use CPU clock ticks for timing? This blog post provides technical documentation describing the steps in fio 3.32. The process is quite an ordeal but does include some nice engineering.
The first step is to relate CPU clock ticks to wall time. Fio uses system calls
to query wall clock time and counts the number of CPU clock ticks in a given
duration in gettime.c:get_cycles_per_msec()
:
get_cycles_per_msec()
static unsigned long get_cycles_per_msec(void)
{
struct timespec s, e;
uint64_t c_s, c_e;
uint64_t elapsed;
fio_get_mono_time(&s);
c_s = get_cpu_clock();
do {
fio_get_mono_time(&e);
c_e = get_cpu_clock();
elapsed = ntime_since(&s, &e);
if (elapsed >= 1280000)
break;
} while (1);
return (c_e - c_s) * 1000000 / elapsed;
}
Fio first records a starting wall clock timestamp and a starting CPU clock tick count. Then, the current wall clock time and CPU clock tick count are repeatedly requested until at least 1.28ms has elapsed. At that point the difference between the start and end CPU clock tick counts is divided by the elapsed wall clock time in order to calculate the number of clock ticks per millisecond.
This process is actually repeated 50 times in order to obtain a reliable measurement. Outliers are discarded and a trimmed mean is ultimately used to measure the rate at which the CPU clock ticks.
How does fio actually use this measurement to convert a clock tick count to nanoseconds? The straightforward approach would simply use integer division to divide a CPU clock tick count by the number of ticks per nanosecond to obtain the number of elapsed nanoseconds. However, this naive approach has two drawbacks:
First, integer division is slow. From Agner Fog's instruction tables we see that Intel's Tiger Lake microarchitecture has a minimum latency of 15 clock cycles for 64-bit integer division, whereas the minimum latency for bit shift operations is one cycle and for 64-bit integer multiplication three cycles. For AMD's Zen 3 microarchitecture the minimum latency is 9-17 clock cycles for 64-bit integer division. This is much slower than what is required for bit shift operations (0.5 cycle minimum latency) and 64-bit integer multiplication (one cycle minimum latency).
Second, dividing clock ticks by ticks per nanosecond can also be inaccurate.
Suppose our CPU has a timestamp counter that ticks at a rate of 3 GHz. We would
expect
Fortunately, fio has a solution that avoids integer division and also improves accuracy. Integer division in general is slow, but integer division via bit-shift operations is fast. If we could do some arithmetic to convert the divisor to a power of two then we would be able to take advantage of division via bit shift operations. The equation below describes what fio ultimately does for a CPU clock that ticks at 3.33 GHz:
Instead of naively dividing clock ticks by three (when the divisor should be
3.33) we first multiply the numerator and denominator by
Even this nifty trick does not take us to the finish line, however. The problem
is that for greatest accuracy we want
ticks per msec | ticks per ns | (int) |
conversion of one second's worth of ticks (ns) | error (ns) | ||
---|---|---|---|---|---|---|
3,330,000 | 3.33 | 16 | 4.80480 | 4 | 832,500,000 | 167,500,000 |
314887.687687 | 314887 | 999,997,816 | 2184 | |||
1259550.750 | 1259550 | 999,999,403 | 597 |
So how does fio resolve this problem? Fio's solution is to carry out the
conversion from ticks to nanoseconds in two stages. The first stage converts
large chunks of ticks to nanoseconds and then the remaining ticks that are
smaller than the chunk size can be converted using the multiplication and shift
strategy described above. The product in the multiply and shift procedure is
thus capped at the chunk size to avoid overflow. The two stages are carried out
in gettime.c:__fio_gettime()
:
multiples = t >> max_cycles_shift;
nsecs = multiples * nsecs_for_max_cycles;
nsecs += ((t & max_cycles_mask) * clock_mult) >> clock_shift;
t
is the number of CPU clock ticks. Fio first uses a bit shift operation to
compute multiples
which is the number of chunks of ticks that can be removed.
Then nsecs
is assigned the number of nanoseconds corresponding to those
chunks using the pre-calculated multiplier nsecs_for_max_cycles
. Finally, the
remaining ticks are isolated using a bitmask and then converted to nanoseconds
using the multiplication and shift procedure described above. Ultimately fio
replaces integer division with two bit shift operations, two integer
multiplication operations, a bitmask operation and one addition operation.
Currently the initial chunk size that fio uses is arbitrarily defined to be one
hour via this macro definition in gettime.c
:
#define MAX_CLOCK_SEC 60*60
However, to avoid integer division the chunk size must be a power of two ticks.
Thus, fio uses its ticks per millisecond measurement to find the greatest power
of two ticks that is less than or equal to the number of ticks that would occur
in one hour. For a timestamp counter that ticks at a rate of 3 GHz, one hour of
wall clock time would correspond to
Now that we have covered the multi-stage ticks to nanosecond conversion process
let us go through a concrete example. We can see the calculations that fio uses
to determine the components used for the conversion when we run fio with the
--debug=time
option. For simplicity this example will just have fio verify
the consistency of the CPU clock using the --cpuclok-test
option instead of
running an actual workload. I will divide the debug output into three panels.
The first panel displays the result of fifty calls to get_cycles_per_msec()
.
The final lines display summary statistics for these measurements and finally
the trimmed mean that will ultimately be used for the conversion. In this case
fio measured that 2600001 clock ticks occur in each millisecond and this
value was calculated from the mean of 40 out of the 50 individual measurements.
cycles per msec
vincent@ubuntu$ ./fio --cpuclock-test --debug=time
cs: reliable_tsc: yes
time 301463 cycles[0]=2600003
time 301463 cycles[1]=2600003
time 301463 cycles[2]=2600002
time 301463 cycles[3]=2600001
time 301463 cycles[4]=2600001
time 301463 cycles[5]=2600002
time 301463 cycles[6]=2600001
time 301463 cycles[7]=2600001
time 301463 cycles[8]=2600001
time 301463 cycles[9]=2600000
time 301463 cycles[10]=2600001
time 301463 cycles[11]=2600000
time 301463 cycles[12]=2600003
time 301463 cycles[13]=2600003
time 301463 cycles[14]=2600000
time 301463 cycles[15]=2600002
time 301463 cycles[16]=2600004
time 301463 cycles[17]=2600002
time 301463 cycles[18]=2600003
time 301463 cycles[19]=2600003
time 301463 cycles[20]=2600003
time 301463 cycles[21]=2600004
time 301463 cycles[22]=2600001
time 301463 cycles[23]=2600003
time 301463 cycles[24]=2600001
time 301463 cycles[25]=2600001
time 301463 cycles[26]=2600001
time 301463 cycles[27]=2600004
time 301463 cycles[28]=2600003
time 301463 cycles[29]=2600003
time 301463 cycles[30]=2600001
time 301463 cycles[31]=2600005
time 301463 cycles[32]=2599999
time 301463 cycles[33]=2600003
time 301463 cycles[34]=2600002
time 301463 cycles[35]=2600003
time 301463 cycles[36]=2600002
time 301463 cycles[37]=2600002
time 301463 cycles[38]=2600005
time 301463 cycles[39]=2600003
time 301463 cycles[40]=2600001
time 301463 cycles[41]=2600003
time 301463 cycles[42]=2600002
time 301463 cycles[43]=2600002
time 301463 cycles[44]=2600001
time 301463 cycles[45]=2600001
time 301463 cycles[46]=2600001
time 301463 cycles[47]=2600002
time 301463 cycles[48]=2600001
time 301463 cycles[49]=2600004
time 301463 min=2599999, max=2600005, mean=2600002.060000, S=0.026314, N=50
time 301463 trimmed mean=2600001, N=40
The second set of debug output lines are below and they describe the process of
finding the largest max_ticks
is the number of times
the CPU clock would tick for the largest possible chunk size. max_mult
is the
largest possible value of the multiplier for the multiply and shift conversion.
We then use max_mult
to find the corresponding value for
multiply and shift
time 301463
max_ticks=9360003600000, __builtin_clzll=20, max_mult=1970805
time 301463 tmp=2562047, sft=1
time 301463 tmp=1281023, sft=2
time 301463 tmp=640511, sft=3
time 301463 tmp=320255, sft=4
time 301463 tmp=160127, sft=5
time 301463 tmp=80063, sft=6
time 301463 tmp=40031, sft=7
time 301463 tmp=20015, sft=8
time 301463 tmp=10007, sft=9
time 301463 tmp=5003, sft=10
time 301463 tmp=2501, sft=11
time 301463 tmp=1250, sft=12
time 301463 tmp=625, sft=13
time 301463 tmp=312, sft=14
time 301463 tmp=156, sft=15
time 301463 tmp=78, sft=16
time 301463 tmp=39, sft=17
time 301463 tmp=19, sft=18
time 301463 tmp=9, sft=19
time 301463 tmp=4, sft=20
time 301463 tmp=2, sft=21
time 301463 tmp=1, sft=22
time 301463 clock_shift=22, clock_mult=1613193
Finally the third output panel describes how fio finds a power of two ticks for
the large chunks of time. Fio begins by calculating the number of times the CPU
clock ticks in one hour (9360003600000). Then this value is repeatedly shifted
to the right to find the highest bit that is set to one. In this
case 9360003600000 has its 44th bit set to 1. So the number of ticks in a
chunk is
chunk size
time 301463 tmp=9360003600000, max_cycles_shift=0
time 301463 tmp=4680001800000, max_cycles_shift=1
time 301463 tmp=2340000900000, max_cycles_shift=2
time 301463 tmp=1170000450000, max_cycles_shift=3
time 301463 tmp=585000225000, max_cycles_shift=4
time 301463 tmp=292500112500, max_cycles_shift=5
time 301463 tmp=146250056250, max_cycles_shift=6
time 301463 tmp=73125028125, max_cycles_shift=7
time 301463 tmp=36562514062, max_cycles_shift=8
time 301463 tmp=18281257031, max_cycles_shift=9
time 301463 tmp=9140628515, max_cycles_shift=10
time 301463 tmp=4570314257, max_cycles_shift=11
time 301463 tmp=2285157128, max_cycles_shift=12
time 301463 tmp=1142578564, max_cycles_shift=13
time 301463 tmp=571289282, max_cycles_shift=14
time 301463 tmp=285644641, max_cycles_shift=15
time 301463 tmp=142822320, max_cycles_shift=16
time 301463 tmp=71411160, max_cycles_shift=17
time 301463 tmp=35705580, max_cycles_shift=18
time 301463 tmp=17852790, max_cycles_shift=19
time 301463 tmp=8926395, max_cycles_shift=20
time 301463 tmp=4463197, max_cycles_shift=21
time 301463 tmp=2231598, max_cycles_shift=22
time 301463 tmp=1115799, max_cycles_shift=23
time 301463 tmp=557899, max_cycles_shift=24
time 301463 tmp=278949, max_cycles_shift=25
time 301463 tmp=139474, max_cycles_shift=26
time 301463 tmp=69737, max_cycles_shift=27
time 301463 tmp=34868, max_cycles_shift=28
time 301463 tmp=17434, max_cycles_shift=29
time 301463 tmp=8717, max_cycles_shift=30
time 301463 tmp=4358, max_cycles_shift=31
time 301463 tmp=2179, max_cycles_shift=32
time 301463 tmp=1089, max_cycles_shift=33
time 301463 tmp=544, max_cycles_shift=34
time 301463 tmp=272, max_cycles_shift=35
time 301463 tmp=136, max_cycles_shift=36
time 301463 tmp=68, max_cycles_shift=37
time 301463 tmp=34, max_cycles_shift=38
time 301463 tmp=17, max_cycles_shift=39
time 301463 tmp=8, max_cycles_shift=40
time 301463 tmp=4, max_cycles_shift=41
time 301463 tmp=2, max_cycles_shift=42
time 301463 tmp=1, max_cycles_shift=43
time 301463 max_cycles_shift=43, 2^max_cycles_shift=8796093022208, nsecs_for_max_cycles=3383110926336, max_cycles_mask=000007ffffffffff
time 301463 cycles_start=24488928610874798
Below are the details of the calculation using the parameters selected by fio
for one hour's worth of ticks on this platform. At a rate of 2600001 ticks per
millisecond one hour would be
Step | Description | Value | running total (ns) | Calculation | Notes |
---|---|---|---|---|---|
1 | Calculate how many chunks of |
1 | 0 | 9360003600000 >> 43 = 1 | This is the first stage and is needed to avoid overflow in the multiplication when we multiply and shift to replace integer division. |
2 | Multiply number of chunks by nsec per chunk (3383110926336) | 3,383,110,926,336 | 3,383,110,926,336 | 1 * 3383110926336 = 3383110926336 | This is the nanoseconds corresponding to the large chunks of ticks. |
3 | Use a bitmask to calculate the remaining ticks after chunks have been removed | 563,910,577,792 | 3,383,110,926,336 | 9360003600000 & 0x000007ffffffffff = 563910577792 | We use multiplication and shift to divide these ticks by ticks per nanosecond. |
4 | Multiply value from Step 3 by clock_mult (1613193) |
909,696,596,720,009,856 | 3,383,110,926,336 | 563910577792 * 1613193 = 909696596720009856 | This is the multiplication in the multiply and shift procedure. |
5 | Right-shift value from Step 4 by clock_shift (22) |
216,888,570,003 | 3,383,110,926,336 | 909696596720009856 >> 22 = 216888570003 | The is the shift in the multiply and shift procedure. |
6 | Add value from Step 5 to nsec running total | 3,599,999,496,339 | 3,599,999,496,339 | 3383110926336 + 216888570003 = 3599999496339 | Add together the nanoseconds from the chunks and the multiply and shift procedure to obtain the total. |
Since we started with one hour's worth of ticks the result from the steps above should be 3 600 000 000 000 ns but we underestimated the time by 503,661 ns, an error of 0.0000714 percent
- Fio records a CPU clock tick count at job start (the final
cycles_start
line in the third debug output panel above) and for every timestamp request, the current CPU clock tick count is queried and the difference between the current count and the starting count is used to create a timestamp. The timestamp is stored as the difference between the given time and the start time in a struct timespec:
struct timespec {
time_t tv_sec; /* elapsed time in whole seconds */
long tv_nsec; /* remaining elapsed time in nanoseconds */
};
- The two-stage multiply and shift procedure produces a value for time in
nanoseconds. Converting to
struct timespec
still requires an integer division operation and a modulus operation to separate the whole seconds from the fractional seconds:
tp->tv_sec = nsecs / 1000000000ULL;
tp->tv_nsec = nsecs % 1000000000ULL;
-
When fio moved from microsecond to nanosecond measurement we considered a few options to replace the original single stage multiply and shift procedure. These included using 128-bit integers, emulating 128-bit arithmetic with 64-bit integers, and using a lock to deal with overflow. Ultimately we settled on the two-stage conversion procedure detailed here.
-
The fio repository includes a test program to evaluate various ticks to time conversion strategies: link
-
Instead of repeatedly dividing by two in order to find the most significant bit, it would involve less code to simply use
__builtin_clzll
which gives the number of leading zeroes. Changing the type fromunsigned long long
touint64_t
and subtracting the number of leading zeros from 64 gives us the position of the most significant bit. Nevertheless, the current code works correctly (as far as I can tell) and this change would require significant testing to ensure no regressions were introduced. Here is a discussion that includes example code for identifying the most significant bit using shift operations and GCC's built-in functions: https://www.geeksforgeeks.org/find-significant-set-bit-number/. -
Fio's two-stage conversion procedure was adapted for use in a timing library: https://github.com/AndreyNevolin/wtmlib
-
The author of the above timing library opened a GitHub issue for improving fio's timing code (https://github.com/axboe/fio/issues/695) which includes these suggestions:
- record cycles_per_second instead of cycles_per_millisecond for better accuracy: still open
- adjust the size for MAX_CLOCK_SEC to improve accuracy: still open
- improve
get_cycles_per_msec
: implemented by https://github.com/axboe/fio/commit/152529ae26d1167779138b6cd30d4de10623da1b
-
The debug print for
max_ticks
starts with two newlines. I have no idea why I did that originally. Fixed here. -
Instead of busy waiting in
get_cycles_per_msec()
perhaps fio shouldusleep(1280)
before starting to query the system clock. -
When fio is run for a real job with
--debug=time
it actually goes through the clock tick rate estimation process and computes the conversion parameters twice. This is probably not necessary but seems to do no harm. -
Because integer division is used everywhere, we throw away any remainder and do not round up if the remainder is more than half of the divisor. Thus fio's times likely underestimate the true time. Perhaps integer division with rounding would help mitigate this bias.
-
Integer division latency has improved significantly in recent years. Based on Agner Fog's instruction tables, for Intel's Coffee Lake (2017) microarchitecture the minimum latency for 64-bit integer division is 42-95 instruction cycles. Contrast this with the corresponding values for today's Intel's Tiger Lake microarchitecture (15 clock cycles) and AMD's Zen 3 microarchitecture (9-17 clock cycles).
-
When estimating clock ticks per millisecond, fio counts the number of clock ticks in 1280us and repeats this process 50 times. Both of these choices are essentially arbitrary, although the sample size of 50 is a commonly used rule of thumb.
-
Using the CPU clock tick counter is expected to be faster than a system call for obtaining a timestamp. However, there may be circumstances when the CPU clock tick counter is unavailable and system calls must be used. One way to speed up system calls is to use a virtual dynamic shared object (vDSO) that allows the system call code to execute in user space. This may speed up
clock_gettime
calls although in practice it does not always work out. For more details see https://berthub.eu/articles/posts/on-linux-vdso-and-clockgettime/. -
Out of order execution is a concern with the
RDTSC
instruction. We should consider the alternativeRDTSCP
possibly with the addition of barrier instructions before or after. On the 64-bit ARM architecture we had a recent fix for speculative execution: link. -
Consider replacing the inline assembly code we have for
RDTSC
with the__rdtsc()
intrinsic. -
On the x86 platform fio checks for the invariant TSC flag and only uses the CPU timestamp counter if this flag indicates that this counter is not affected by frequency scaling. The code checking this flag is here. Documentation for the CPUID call is here.