Adam Manzanares @nmtadam provided helpful feedback on an earlier draft of this post. All remaining errors and omissions are mine of course.

Fio collects multiple timestamps for every single I/O (see latency blog post). Fio also keeps track of time for many other features (runtime, bandwidth/IOPS logging, steady state detection, etc). A simple approach would query the operating system for each timestamp, but system calls are expensive and the cumulative cost could be enormous given how frequently fio records timestamps. So when possible, fio uses the CPU clock for time keeping. For modern x86-64 CPUs this is done via the RDTSC instruction which returns the number of CPU cycles elapsed since the processor was reset.

How does fio actually use CPU clock ticks for timing? This blog post provides technical documentation describing the steps in fio 3.32. The process is quite an ordeal but does include some nice engineering.

Cycles per millisecond

The first step is to relate CPU clock ticks to wall time. Fio uses system calls to query wall clock time and counts the number of CPU clock ticks in a given duration in gettime.c:get_cycles_per_msec():

get_cycles_per_msec()

static unsigned long get_cycles_per_msec(void)
{
        struct timespec s, e;
        uint64_t c_s, c_e;
        uint64_t elapsed;

        fio_get_mono_time(&s);

        c_s = get_cpu_clock();
        do {
                fio_get_mono_time(&e);
                c_e = get_cpu_clock();

                elapsed = ntime_since(&s, &e);
                if (elapsed >= 1280000)
                        break;
        } while (1);

        return (c_e - c_s) * 1000000 / elapsed;
}

Fio first records a starting wall clock timestamp and a starting CPU clock tick count. Then, the current wall clock time and CPU clock tick count are repeatedly requested until at least 1.28ms has elapsed. At that point the difference between the start and end CPU clock tick counts is divided by the elapsed wall clock time in order to calculate the number of clock ticks per millisecond.

This process is actually repeated 50 times in order to obtain a reliable measurement. Outliers are discarded and a trimmed mean is ultimately used to measure the rate at which the CPU clock ticks.

Multiply and shift

How does fio actually use this measurement to convert a clock tick count to nanoseconds? The straightforward approach would simply use integer division to divide a CPU clock tick count by the number of ticks per nanosecond to obtain the number of elapsed nanoseconds. However, this naive approach has two drawbacks:

Integer division is slow

First, integer division is slow. From Agner Fog's instruction tables we see that Intel's Tiger Lake microarchitecture has a minimum latency of 15 clock cycles for 64-bit integer division, whereas the minimum latency for bit shift operations is one cycle and for 64-bit integer multiplication three cycles. For AMD's Zen 3 microarchitecture the minimum latency is 9-17 clock cycles for 64-bit integer division. This is much slower than what is required for bit shift operations (0.5 cycle minimum latency) and 64-bit integer multiplication (one cycle minimum latency).

Integer division is inaccurate

Second, dividing clock ticks by ticks per nanosecond can also be inaccurate. Suppose our CPU has a timestamp counter that ticks at a rate of 3 GHz. We would expect $3 \cdot 10^9$ clock ticks per second or three ticks per nanosecond. So to use simple integer division for the conversion we simply divide the CPU clock tick count by three. However, what if we had a CPU clock that ticked at a rate of 2.6 GHz or 3.33 GHz? At 3.33 GHz simple integer division would still have fio divide clock ticks by three. This would be inaccurate of course. For every 30 clock ticks, fio would divide by three and report that 10ns had passed when in fact only 9.01ns had passed.

Fio's solution

Fortunately, fio has a solution that avoids integer division and also improves accuracy. Integer division in general is slow, but integer division via bit-shift operations is fast. If we could do some arithmetic to convert the divisor to a power of two then we would be able to take advantage of division via bit shift operations. The equation below describes what fio ultimately does for a CPU clock that ticks at 3.33 GHz:

$$ns = \frac{ticks}{3.33} = \frac{ticks \cdot FACTOR}{3.33 \cdot FACTOR} = \frac{ticks \cdot \frac{FACTOR}{3.33}}{FACTOR}$$

Instead of naively dividing clock ticks by three (when the divisor should be 3.33) we first multiply the numerator and denominator by $FACTOR$. Then, we rearrange the terms in order to compute elapsed nanoseconds as the product of the number of ticks and $\frac{FACTOR}{3.33}$ divided by $FACTOR$. It is easy enough to pre-compute $\frac{FACTOR}{3.33}$ and then as long as $FACTOR$ is a power of two we can convert the time consuming integer division operation into one integer multiplication and one bit shift operation.

Getting to the finish line: chunks

Even this nifty trick does not take us to the finish line, however. The problem is that for greatest accuracy we want $FACTOR$ to be as large as possible but if we choose too large a $FACTOR$ the product in the numerator can overflow. In the example above where the clock ticks 3.33 times every nanosecond we could set $FACTOR$ to 16 and then the (integer) multiplier in the numerator would be $\frac{16}{3.33} = 4$. The table below shows the outcome using 16 for $FACTOR$: $3.33 \cdot 10^9$ ticks would be converted to 832.5 ms instead of one second, an error of 17 percent. If we instead set $FACTOR$ to be $2^{20}$, then $3.33 \cdot 10^9$ ticks would be converted to 999,998 us, an error of 0.0002 percent. With $FACTOR$ set to $2^{22}$ the error is 0.00006 percent. Thus, it is clear that if we increase $FACTOR$ we obtain better measurements. However, the problem of numerical overflow in the numerator remains. With larger values for $FACTOR$ we gain accuracy but become more vulnerable to overflow as the number of ticks increases.

ticks per msec	ticks per ns	$FACTOR$	$\frac{FACTOR}{ticks\ per\ ns}$	(int) $\frac{FACTOR}{ticks\ per\ ns}$	conversion of one second's worth of ticks (ns)	error (ns)
3,330,000	3.33	16	4.80480	4	832,500,000	167,500,000
		$2^{20}$	314887.687687	314887	999,997,816	2184
		$2^{22}$	1259550.750	1259550	999,999,403	597

So how does fio resolve this problem? Fio's solution is to carry out the conversion from ticks to nanoseconds in two stages. The first stage converts large chunks of ticks to nanoseconds and then the remaining ticks that are smaller than the chunk size can be converted using the multiplication and shift strategy described above. The product in the multiply and shift procedure is thus capped at the chunk size to avoid overflow. The two stages are carried out in gettime.c:__fio_gettime():

                multiples = t >> max_cycles_shift;
                nsecs = multiples * nsecs_for_max_cycles;
                nsecs += ((t & max_cycles_mask) * clock_mult) >> clock_shift;

t is the number of CPU clock ticks. Fio first uses a bit shift operation to compute multiples which is the number of chunks of ticks that can be removed. Then nsecs is assigned the number of nanoseconds corresponding to those chunks using the pre-calculated multiplier nsecs_for_max_cycles. Finally, the remaining ticks are isolated using a bitmask and then converted to nanoseconds using the multiplication and shift procedure described above. Ultimately fio replaces integer division with two bit shift operations, two integer multiplication operations, a bitmask operation and one addition operation.

Currently the initial chunk size that fio uses is arbitrarily defined to be one hour via this macro definition in gettime.c:

#define MAX_CLOCK_SEC 60*60

However, to avoid integer division the chunk size must be a power of two ticks. Thus, fio uses its ticks per millisecond measurement to find the greatest power of two ticks that is less than or equal to the number of ticks that would occur in one hour. For a timestamp counter that ticks at a rate of 3 GHz, one hour of wall clock time would correspond to $108 \cdot 10^{11}$ ticks. The greatest power of two less than $108 \cdot 10^{11}$ is 8796093022208 or $2^{43}$. This corresponds to approximately 2932 seconds. Thus, for each tick count fio would first use a bit shift operation to divide by $2^{43}$. This quotient would be multiplied by $2932 \cdot 10^9$ to obtain the number of nanoseconds corresponding to these large chunks of time. Finally the multiply and shift procedure would be carried out on the remaining tick count and added to the nanoseconds calculated from the large chunks to complete the conversion of CPU clock ticks to nanoseconds.

Example

Now that we have covered the multi-stage ticks to nanosecond conversion process let us go through a concrete example. We can see the calculations that fio uses to determine the components used for the conversion when we run fio with the --debug=time option. For simplicity this example will just have fio verify the consistency of the CPU clock using the --cpuclok-test option instead of running an actual workload. I will divide the debug output into three panels.

`get_cycles_per_msec()`

The first panel displays the result of fifty calls to get_cycles_per_msec(). The final lines display summary statistics for these measurements and finally the trimmed mean that will ultimately be used for the conversion. In this case fio measured that 2600001 clock ticks occur in each millisecond and this value was calculated from the mean of 40 out of the 50 individual measurements.

cycles per msec

vincent@ubuntu$ ./fio --cpuclock-test --debug=time
cs: reliable_tsc: yes
time     301463 cycles[0]=2600003
time     301463 cycles[1]=2600003
time     301463 cycles[2]=2600002
time     301463 cycles[3]=2600001
time     301463 cycles[4]=2600001
time     301463 cycles[5]=2600002
time     301463 cycles[6]=2600001
time     301463 cycles[7]=2600001
time     301463 cycles[8]=2600001
time     301463 cycles[9]=2600000
time     301463 cycles[10]=2600001
time     301463 cycles[11]=2600000
time     301463 cycles[12]=2600003
time     301463 cycles[13]=2600003
time     301463 cycles[14]=2600000
time     301463 cycles[15]=2600002
time     301463 cycles[16]=2600004
time     301463 cycles[17]=2600002
time     301463 cycles[18]=2600003
time     301463 cycles[19]=2600003
time     301463 cycles[20]=2600003
time     301463 cycles[21]=2600004
time     301463 cycles[22]=2600001
time     301463 cycles[23]=2600003
time     301463 cycles[24]=2600001
time     301463 cycles[25]=2600001
time     301463 cycles[26]=2600001
time     301463 cycles[27]=2600004
time     301463 cycles[28]=2600003
time     301463 cycles[29]=2600003
time     301463 cycles[30]=2600001
time     301463 cycles[31]=2600005
time     301463 cycles[32]=2599999
time     301463 cycles[33]=2600003
time     301463 cycles[34]=2600002
time     301463 cycles[35]=2600003
time     301463 cycles[36]=2600002
time     301463 cycles[37]=2600002
time     301463 cycles[38]=2600005
time     301463 cycles[39]=2600003
time     301463 cycles[40]=2600001
time     301463 cycles[41]=2600003
time     301463 cycles[42]=2600002
time     301463 cycles[43]=2600002
time     301463 cycles[44]=2600001
time     301463 cycles[45]=2600001
time     301463 cycles[46]=2600001
time     301463 cycles[47]=2600002
time     301463 cycles[48]=2600001
time     301463 cycles[49]=2600004
time     301463 min=2599999, max=2600005, mean=2600002.060000, S=0.026314, N=50
time     301463 trimmed mean=2600001, N=40

Multiply and shift

The second set of debug output lines are below and they describe the process of finding the largest $FACTOR$ that will not cause overflow when the multiplication and shift operation is used for converting ticks to nanoseconds for a period shorter than the chunk size. max_ticks is the number of times the CPU clock would tick for the largest possible chunk size. max_mult is the largest possible value of the multiplier for the multiply and shift conversion. We then use max_mult to find the corresponding value for $FACTOR$ and repeatedly divide this value by two to find the most significant bit. We loop 22 times to determine the highest significant bit and identify $2^{22}$ as the largest value for $FACTOR$ that will avoid overflow in the product when we carry out our multiply and shift procedure. The multiplier we actually use is also calculated as $\frac{FACTOR}{ticks\ per\ ns}$ or 1613193. Thus, for the multiplication and shift procedure, we will multiply the CPU clock tick count by 1613193 and then shift right by 22 bits. The result will be time in nanoseconds.

multiply and shift

time     301463

max_ticks=9360003600000, __builtin_clzll=20, max_mult=1970805
time     301463 tmp=2562047, sft=1
time     301463 tmp=1281023, sft=2
time     301463 tmp=640511, sft=3
time     301463 tmp=320255, sft=4
time     301463 tmp=160127, sft=5
time     301463 tmp=80063, sft=6
time     301463 tmp=40031, sft=7
time     301463 tmp=20015, sft=8
time     301463 tmp=10007, sft=9
time     301463 tmp=5003, sft=10
time     301463 tmp=2501, sft=11
time     301463 tmp=1250, sft=12
time     301463 tmp=625, sft=13
time     301463 tmp=312, sft=14
time     301463 tmp=156, sft=15
time     301463 tmp=78, sft=16
time     301463 tmp=39, sft=17
time     301463 tmp=19, sft=18
time     301463 tmp=9, sft=19
time     301463 tmp=4, sft=20
time     301463 tmp=2, sft=21
time     301463 tmp=1, sft=22
time     301463 clock_shift=22, clock_mult=1613193

Chunk size

Finally the third output panel describes how fio finds a power of two ticks for the large chunks of time. Fio begins by calculating the number of times the CPU clock ticks in one hour (9360003600000). Then this value is repeatedly shifted to the right to find the highest bit that is set to one. In this case 9360003600000 has its 44th bit set to 1. So the number of ticks in a chunk is $2^{43}$ which corresponds to 3383110926336 ns. The bitmask we use is 0x000007ffffffffff which will zero out bits 44 and higher in order to eliminate all the chunks of ticks before using our multiply and shift conversion procedure.

chunk size

time     301463 tmp=9360003600000, max_cycles_shift=0
time     301463 tmp=4680001800000, max_cycles_shift=1
time     301463 tmp=2340000900000, max_cycles_shift=2
time     301463 tmp=1170000450000, max_cycles_shift=3
time     301463 tmp=585000225000, max_cycles_shift=4
time     301463 tmp=292500112500, max_cycles_shift=5
time     301463 tmp=146250056250, max_cycles_shift=6
time     301463 tmp=73125028125, max_cycles_shift=7
time     301463 tmp=36562514062, max_cycles_shift=8
time     301463 tmp=18281257031, max_cycles_shift=9
time     301463 tmp=9140628515, max_cycles_shift=10
time     301463 tmp=4570314257, max_cycles_shift=11
time     301463 tmp=2285157128, max_cycles_shift=12
time     301463 tmp=1142578564, max_cycles_shift=13
time     301463 tmp=571289282, max_cycles_shift=14
time     301463 tmp=285644641, max_cycles_shift=15
time     301463 tmp=142822320, max_cycles_shift=16
time     301463 tmp=71411160, max_cycles_shift=17
time     301463 tmp=35705580, max_cycles_shift=18
time     301463 tmp=17852790, max_cycles_shift=19
time     301463 tmp=8926395, max_cycles_shift=20
time     301463 tmp=4463197, max_cycles_shift=21
time     301463 tmp=2231598, max_cycles_shift=22
time     301463 tmp=1115799, max_cycles_shift=23
time     301463 tmp=557899, max_cycles_shift=24
time     301463 tmp=278949, max_cycles_shift=25
time     301463 tmp=139474, max_cycles_shift=26
time     301463 tmp=69737, max_cycles_shift=27
time     301463 tmp=34868, max_cycles_shift=28
time     301463 tmp=17434, max_cycles_shift=29
time     301463 tmp=8717, max_cycles_shift=30
time     301463 tmp=4358, max_cycles_shift=31
time     301463 tmp=2179, max_cycles_shift=32
time     301463 tmp=1089, max_cycles_shift=33
time     301463 tmp=544, max_cycles_shift=34
time     301463 tmp=272, max_cycles_shift=35
time     301463 tmp=136, max_cycles_shift=36
time     301463 tmp=68, max_cycles_shift=37
time     301463 tmp=34, max_cycles_shift=38
time     301463 tmp=17, max_cycles_shift=39
time     301463 tmp=8, max_cycles_shift=40
time     301463 tmp=4, max_cycles_shift=41
time     301463 tmp=2, max_cycles_shift=42
time     301463 tmp=1, max_cycles_shift=43
time     301463 max_cycles_shift=43, 2^max_cycles_shift=8796093022208, nsecs_for_max_cycles=3383110926336, max_cycles_mask=000007ffffffffff
time     301463 cycles_start=24488928610874798

One hour's worth of ticks

Below are the details of the calculation using the parameters selected by fio for one hour's worth of ticks on this platform. At a rate of 2600001 ticks per millisecond one hour would be $60 \cdot 60 \cdot 2600001 \cdot 1000 = 9360003600000$ ticks. Here are the steps:

Step	Description	Value	running total (ns)	Calculation	Notes
1	Calculate how many chunks of $2^{43}$ ticks are in 9,360,003,600,000 ticks	1	0	9360003600000 >> 43 = 1	This is the first stage and is needed to avoid overflow in the multiplication when we multiply and shift to replace integer division.
2	Multiply number of chunks by nsec per chunk (3383110926336)	3,383,110,926,336	3,383,110,926,336	1 * 3383110926336 = 3383110926336	This is the nanoseconds corresponding to the large chunks of ticks.
3	Use a bitmask to calculate the remaining ticks after chunks have been removed	563,910,577,792	3,383,110,926,336	9360003600000 & 0x000007ffffffffff = 563910577792	We use multiplication and shift to divide these ticks by ticks per nanosecond.
4	Multiply value from Step 3 by `clock_mult` (1613193)	909,696,596,720,009,856	3,383,110,926,336	563910577792 * 1613193 = 909696596720009856	This is the multiplication in the multiply and shift procedure.
5	Right-shift value from Step 4 by `clock_shift` (22)	216,888,570,003	3,383,110,926,336	909696596720009856 >> 22 = 216888570003	The is the shift in the multiply and shift procedure.
6	Add value from Step 5 to nsec running total	3,599,999,496,339	3,599,999,496,339	3383110926336 + 216888570003 = 3599999496339	Add together the nanoseconds from the chunks and the multiply and shift procedure to obtain the total.

Since we started with one hour's worth of ticks the result from the steps above should be 3 600 000 000 000 ns but we underestimated the time by 503,661 ns, an error of 0.0000714 percent

Notes

Fio records a CPU clock tick count at job start (the final cycles_start line in the third debug output panel above) and for every timestamp request, the current CPU clock tick count is queried and the difference between the current count and the starting count is used to create a timestamp. The timestamp is stored as the difference between the given time and the start time in a struct timespec:

struct timespec {
                         time_t          tv_sec;	/* elapsed time in whole seconds */
                         long            tv_nsec;	/* remaining elapsed time in nanoseconds */
                 };

The two-stage multiply and shift procedure produces a value for time in nanoseconds. Converting to struct timespec still requires an integer division operation and a modulus operation to separate the whole seconds from the fractional seconds:

                tp->tv_sec = nsecs / 1000000000ULL;
                tp->tv_nsec = nsecs % 1000000000ULL;

When fio moved from microsecond to nanosecond measurement we considered a few options to replace the original single stage multiply and shift procedure. These included using 128-bit integers, emulating 128-bit arithmetic with 64-bit integers, and using a lock to deal with overflow. Ultimately we settled on the two-stage conversion procedure detailed here.
The fio repository includes a test program to evaluate various ticks to time conversion strategies: link
Instead of repeatedly dividing by two in order to find the most significant bit, it would involve less code to simply use __builtin_clzll which gives the number of leading zeroes. Changing the type from unsigned long long to uint64_t and subtracting the number of leading zeros from 64 gives us the position of the most significant bit. Nevertheless, the current code works correctly (as far as I can tell) and this change would require significant testing to ensure no regressions were introduced. Here is a discussion that includes example code for identifying the most significant bit using shift operations and GCC's built-in functions: https://www.geeksforgeeks.org/find-significant-set-bit-number/.
Fio's two-stage conversion procedure was adapted for use in a timing library: https://github.com/AndreyNevolin/wtmlib
The author of the above timing library opened a GitHub issue for improving fio's timing code (https://github.com/axboe/fio/issues/695) which includes these suggestions:
- record cycles_per_second instead of cycles_per_millisecond for better accuracy: still open
- adjust the size for MAX_CLOCK_SEC to improve accuracy: still open
- improve get_cycles_per_msec: implemented by https://github.com/axboe/fio/commit/152529ae26d1167779138b6cd30d4de10623da1b
The debug print for max_ticks starts with two newlines. I have no idea why I did that originally. Fixed here.
Instead of busy waiting in get_cycles_per_msec() perhaps fio should usleep(1280) before starting to query the system clock.
When fio is run for a real job with --debug=time it actually goes through the clock tick rate estimation process and computes the conversion parameters twice. This is probably not necessary but seems to do no harm.
Because integer division is used everywhere, we throw away any remainder and do not round up if the remainder is more than half of the divisor. Thus fio's times likely underestimate the true time. Perhaps integer division with rounding would help mitigate this bias.
Integer division latency has improved significantly in recent years. Based on Agner Fog's instruction tables, for Intel's Coffee Lake (2017) microarchitecture the minimum latency for 64-bit integer division is 42-95 instruction cycles. Contrast this with the corresponding values for today's Intel's Tiger Lake microarchitecture (15 clock cycles) and AMD's Zen 3 microarchitecture (9-17 clock cycles).
When estimating clock ticks per millisecond, fio counts the number of clock ticks in 1280us and repeats this process 50 times. Both of these choices are essentially arbitrary, although the sample size of 50 is a commonly used rule of thumb.
Using the CPU clock tick counter is expected to be faster than a system call for obtaining a timestamp. However, there may be circumstances when the CPU clock tick counter is unavailable and system calls must be used. One way to speed up system calls is to use a virtual dynamic shared object (vDSO) that allows the system call code to execute in user space. This may speed up clock_gettime calls although in practice it does not always work out. For more details see https://berthub.eu/articles/posts/on-linux-vdso-and-clockgettime/.
Out of order execution is a concern with the RDTSC instruction. We should consider the alternative RDTSCP possibly with the addition of barrier instructions before or after. On the 64-bit ARM architecture we had a recent fix for speculative execution: link.
Consider replacing the inline assembly code we have for RDTSC with the __rdtsc() intrinsic.
On the x86 platform fio checks for the invariant TSC flag and only uses the CPU timestamp counter if this flag indicates that this counter is not affected by frequency scaling. The code checking this flag is here. Documentation for the CPUID call is here.

How does fio keep time - vincentkfu/fio-blog GitHub Wiki

Cycles per millisecond

Multiply and shift

Integer division is slow

Integer division is inaccurate

Fio's solution

Getting to the finish line: chunks

Example

`get_cycles_per_msec()`

Multiply and shift

Chunk size

One hour's worth of ticks

Notes

References

⚠️ GitHub.com Fallback ⚠️

How does fio keep time - vincentkfu/fio-blog GitHub Wiki

Cycles per millisecond

Multiply and shift

Integer division is slow

Integer division is inaccurate

Fio's solution

Getting to the finish line: chunks

Example

get_cycles_per_msec()

Multiply and shift

Chunk size

One hour's worth of ticks

Notes

References

⚠️ **GitHub.com Fallback** ⚠️

`get_cycles_per_msec()`

⚠️ GitHub.com Fallback ⚠️