FPGA Timing Notes - barawn/verilog-library-barawn GitHub Wiki

Overview of various timing details.

MMCM input period specifications

Generally, slowing an FPGA design's clock down doesn't cause serious problems because they're designed for zero hold, so slowing the input clock down just gains you setup slack. However, an important note is that the MMCM/PLLs inside FPGAs have a limited VCO range and so they cannot just take a much slower clock and Just Work.

In addition both the PLL and MMCM have minimum input clock frequencies period, so there are 2 constraints to consider. Obviously there's also a max input clock but it's psycho high so that doesn't usually matter.

The specs are generally in the DC and AC Switching Characteristics portion of the datasheet but I think they're universal between families:

7 series MMCM: Min clock 10 MHz, min VCO 600 MHz, max VCO 1200/1440/1600 depending on speed grade
7 series PLL: Min clock 19 MHz, min VCO 800 MHz, max VCO 1600/1866/2133 depending on speed grade
Ultrascale+ MMCM: 10 MHz min, VCOs 800-1600 MHz regardless of grade
Ultrascale+ PLL: 70 MHz min, VCOs 750-1500 MHz regardless of grade

Figuring out the VCO frequency directly from the MMCM device instantiation is just:

VCO = clock input freq * CLKFBOUT_MULT_F / DIVCLK_DIVIDE

Phase tracking registers between synchronous domains

In order to do proper timing between two synchronous domains of different frequencies, you have to know the phase relationship between the two domains - you have to create phase tracking registers to ensure that the two clock domains know what the other domains are doing.

Note that these phase tracking registers don't have to be global, you can create local copies of them wherever you cross domains. It's only a few registers so it might make sense to do this rather than rely on Vivado replicating them.

Consider a simple case: clock A, and clock B which is "N" times clock A. Transfering data from A to B could just be done by capturing directly, but you would have to hold the data in the B (fast) domain for N clocks to ensure it's captured in the A (slow) domain, and for the reverse (slow to fast) you would not know exactly when "new" data shows up, leading to a clock uncertainty. This can be avoided if clock B knows "when" clock A is about to clock and when it has just clocked.

How do we do this? Simple.

Create a toggling flipflop in the slower clock domain. If the two clocks do not have an integer relationship, create a new domain which has a period that is the LCM of the 2 clock domain periods, and create a toggling flipflop in that domain.
Capture the toggling flipflop in the fast clock domain(s). The change in this toggle tells you when the slower domain changed.
Delay the detection of that toggle by the appropriate number of clocks to be a flag indicating when the slow domain either has changed or is about to change. Generally it's better to do the latter (indicate that it's about to change) because "has changed" is always only 1 clock later.

Crossing between synchronous domains of different frequencies

FPGAs have lots of clocks, so you shouldn't be afraid of hopping clock domains if it makes resource usage easier. But in order to pass data between synchronous domains without FIFOs, you need to have a "hop buffer" along with phase tracking.

Consider 2 clocks at a 4:3 relationship: say clock A is 100 MHz, and clock B is 133 MHz. If you have 8x data/clock in A, you'll have 6x data in clock B (hence the resource usage savings). In order to transfer that data over, you need a hop buffer of width data_width*(LCM(tA,tB)/tA), where tA/tB are the clock periods. So here, 24x data (since data width is 8, and the LCM of the two periods is 30 ns, divided by 10 is 3). And you also need to know the phase of each clock: here, the common period is 30 ns, so there are 3 phases in A (PA=0,1,2) and 4 phases in B (PB=0,1,2,3).

The timings of those phases (in our example) occur at PA = 0, 10, 20 ns, and PB=0, 7.5, 15, 22.5 ns.

You use those phase tracking registers as clock enables to define the launch and capture clocks for the data: for instance, in this example, you launch data at PA = 1, and capture at PB = 5 (the next sequence's PB=1). Each launch/capture has slightly different timing but because the overall period is 3tA (or 4tB), the setup/hold windows are very generous.

To get Vivado to do this, the key is to understand the utter bullcrap that is "set_multicycle_path." Ignore everything that you think from the name of that command. What "set_multicycle_path" does is move the launch and capture clocks. Which is exactly what we want.

However, you need to remember that Vivado aligns clocks at the worst-case alignment. The worst case setup alignment going from A to B is "launch at 20 ns" and "capture at 22.5 ns". Vivado will realign that to "launch at 0" and "capture at 22.5". The worst case hold alignment going from A to B is "launch at 0 ns and capture at 0 ns."

So for instance, in the full example we have a timing diagram of (here A=ACLK, B=MEMCLK):

sequenced_buffer

A launches 8x data at PA 0, 1, and 2 - call them A[7:0], B[7:0], C[7:0].
B captures 6x data at PB 0, 1, 2, 3. B captures B[5:0] at 0, B[7:6] and C[3:0] at 1, A[1:0] and C[7:4] at 2, and A[7:2] at 3.

To deal with the "Vivado alignment", note that what Vivado calls "t=0" occurs at PA 2 (and the next edge of B is PB 3). So if we call these PA 0S/1S/2S and PB 0S/1S/2S/3S, we just have to "add 1" (modulo 3 and 4) from our phases to get "Vivado setup phases." Because the worst-case hold alignment is the same as our normal alignment, we don't have to do anything there.

So now if we consider B[5:0], this is launched at PA 1 (PA 2S) and captured at PB 0 (PB 1S). We therefore need to "move ahead" the launch clock by 2, and "move ahead" the capture clock by 5 (because we need the capture clock after the launch clock, so we move it into the next 4 clock sequence).

However, this movement now screws up the hold check, because in the hold check alignment (which is identical to our alignment) it now launches at PA 2 and captures at PB 5. We therefore need to move the launch edge back to PA 1, and capture edge back to PB 0.

# move setup launch forward 2 source clocks
set_multicycle_path -setup -begin -from [get_cells $sync_xfr_srcB0] -to [get_cells $sync_xfr_dst] 2
# move setup capture forward 5 destination clocks
set_multicycle_path -setup -end -from [get_cells $sync_xfr_srcB0] -to [get_cells $sync_xfr_dst] 5
# move hold launch back 1 source clock
set_multicycle_path -hold -start -from [get_cells $sync_xfr_srcB0] -to [get_cells $sync_xfr_dst] 1
# move hold capture back 5
set_multicycle_path -hold -end -from [get_cells $sync_xfr_srcB0] -to [get_cells $sync_xfr_dst] 5

The math here is:

We start with launch at t=0, capture at t=2.5 ns (closest alignment) for a setup check, and launch at t=0, capture at t=0 for a hold check.
We move the launch clock forward 20 ns for both setup and hold
We move the capture clock forward 37.5 ns for both setup and hold. Setup is now launch at 20 ns, capture at 40 ns (correct), but hold is launch at 20 ns, capture at 37.5 ns (very wrong).
We move launch back 10 ns and capture clock back 37.5. Hold is now launch at 10 ns, and capture at 7.5 ns (correct).

Changing LUT INITs post-routing - preventing timing issues

The router is frustratingly smart enough to recognize when LUTs are pointless. So if a LUT's INIT value is all zeros (never set the output) or all ones (always set the output) the timing path between the inputs and outputs goes away. It's even smart enough to recognize when a specific input is unused, too.

The main problem with this is that it often won't hold fix those paths if it thinks they're not timed. So even if you've got plenty of timing margin, the device will still fail because the route needs to be longer.

This is a problem if you plan on changing the INIT values in a post-route script for some reason. So instead, set the INIT value to either 1 (only set if all inputs are zero) or only the high bit (only set if all inputs are one). Then the router will recognize that it has to time all of those inputs, and when you swap the INIT later, the device will still pass timing.