Projects: FFT Formal Proof - StanfordVLSI/Genesis2 GitHub Wiki

Table of Contents

DO NOT READ THIS PAPER!!!

I mean, sure, you can read it if you want to, but it's pretty awful. All the information is there, but it's just not arranged very well (IMO as the author). I'm starting all over with a second attempt here:

Meanwhile, continue reading this paper if you like, but read on at your own risk...paper continues below...

The Problem

A detailed description of the problem we want to solve can be found in the 2013 VLSI-SoC paper

To summarize the problem here: take a standard schedule of operations for a Fast Fourier Transform processor (FFT) based on a radix-2 Cooley-Tukey algorithm [ref], e.g. for D=8 datapoints and one butterfly unit, as shown in Figure 1. (Note: for your convenience, symbols and conventions used in the ensuing discussion are elaborated in the Nomenclature section at the end of the paper). FIGURE 1

That is, suppose we have eight datapoints d0 through d7 that we want to transform from the time domain to the frequency domain. In Stage 0 of operation, a radix-2 butterfly unit reads datapoints d0 and d1, producing two results that then overwrite the original locations of d0 and d1 in memory. Similar operations occur for datapoint pairs (d2,d3), (d4,d5) and (d6,d7) respectively. In Stage 1, the datapoints are processed with stride-two pairings e.g. (d0,d2), (d4,d6), (d1,d3) and (d5,d7). And finally Stage 2 completes the operation, re-walking the data by stride 4. Table 1 summarizes this access pattern.

 TABLE 1
 Stage 0: (d0,d1) (d2,d3) (d4,d5) (d6,d7)
 Stage 1: (d0,d2) (d4,d6) (d1,d3) (d5,d7)
 Stage 2: (d0,d4) (d1,d5) (d2,d6) (d3,d7)

If the butterfly unit is pipelined such that result writebacks occur at the same time as operand reads, then datapoints (d0,d1) get written to memory at the same time as (d2,d3) are being read from memory. For memory to be able to supply all four datapoints at once, it needs to be four-ported. Alternatively, we could try to use two separate banks of two-ported memory or even four banks of single-ported memory. Of these three choices, four banks of single-ported memory is most efficient in terms of area, so we'd like to make that work if possible.

Now: Within any stage, look at any sequential group of four operands starting on a mod-two boundary. We'll call such a group a quad. If the quad happens to start on a mod-four boundary, we'll call this an aligned quad. Stage 0 has three overlapping quads: (d0,d1,d2,d3), (d2,d3,d4,d5) and (d4,d5,d6,d7), the first and last of which are aligned quads. Our goal is to map each datapoint to one of four memory banks (m0,m1,m2,m3) such that every quad contains exactly one of each bank number m0, m1, m2 and m3. For example, we could map datapoints (d0,d1,d2,d3,d4,d5,d6,d7) to memory banks (m0,m1,m2,m3,m0,m1,m2,m3) respectively, and thus fulfill our goal for each of the three quads in Stage 0:

  • (d0,d1,d2,d3) maps to (m0,m1,m2,m3)
  • (d2,d3,d4,d5) maps to (m2,m3,m0,m1)
  • (d4,d5,d6,d7) maps to (m0,m1,m2,m3)
The first quad in Stage 1, however, fails because its datapoints (d0,d2,d4,d6) would map to (m0,m2,m0,m2). Banks m1 and m3 are not represented in the quad, while m0 and m2 each appear twice.

We're going to fulfill our goal by enforcing an order to each aligned quad, such that every aligned quad within a stage has the same memory bank ordering e.g. (m0,m1,m2,m3). This means that every unaligned quad will also have a consistent ordering, e.g. (m2,m3,m1,m0), and thus we meet our goal of uniqueness for all quads. This property will hold whether the transform is operating on eight datapoints, with two aligned quads per stage, or eight thousand datapoints, with two thousand quads per stage.

In theory the four operations in a given stage all happen simultaneously. In practice, it's not always practical to have enough butterfly units for that. So for instance, if we only have one butterfly unit, it will have to be used four times in succession to complete the transformation for a given stage, using the following four operations:

  1. read, process and writeback datapoints d0 and d1;
  2. read, process and writeback datapoints d2 and d3;
  3. read, process and writeback datapoints d4 and d5;
  4. read, process and writeback datapoints d6 and d7.
In pipelined operation, the writeback of a given data pair occurs at the same time as the read of the next pair of datapoints. Thus for example datapoints (d0,d1,d2,d3) will all have to be accessible at the same time within the same cycle, as will the datapoints in each of the other quads. We want to use single-ported memory for the datapoints, and that is why we require that the datapoints in each quad must all live in separate memory banks (and assuming, of course, that each memory bank has its own separate path to the butterfly unit).

Developing the proof for one butterfly unit (B=1)

Key Insights

Insight 1: We can ensure that each of the four memory banks m0, m1, m2 and m3 are represented in every aligned quad, in some order

Once again, here is our schedule for D=8 datapoints:

  • Stage 0:  (d0,d1)  (d2,d3)  (d4,d5)  (d6,d7)
  • Stage 1:  (d0,d2)  (d4,d6)  (d1,d3)  (d5,d7)
  • Stage 2:  (d0,d4)  (d1,d5)  (d2,d6)  (d3,d7)
Looking carefully at the aligned quads, we see that they all share a unique property. For example, consider the quad represented by the four operands in the last two operations of Stage 1: (d1,d3),(d5,d7). Looking at the binary representation of the indices within any such group of four, only two bits change, the bits ds+1 and ds where s is the stage number and bit d0 is the least significant bit, or LSB. [NB] We call these two bits the toggling bits, and they always count 00, 01, 10, 11 within an aligned quad. In our example, the binary indices for (d1,d3),(d5,d7) are (00001,00011),(00101,00111). The toggling bits d2 and d1 count (00,01),(10,11) while the non-toggling bits d4d3d0 are always 001.

Given this information, we can assign banks to operands such that banknum = PodPev where Pod is the parity of ALL odd bits and Pev is the parity of ALL even bits. This means that any given sequential block of four operands will map to four separate banks.

In our example quad [(d1,d3),(d5,d7)], or [(00001,00011),(00101,00111)]

  • d1: Pod(00001) = parity(0,0) = 0;    Pev(00001) = parity(0,0,1) = 1;    PodPev = 01;    banknum m = 1
  • d3: Pod(00011) = parity(0,1) = 1;    Pev(00011) = parity(0,0,1) = 1;    PodPev = 11;    banknum m = 3
  • d5: Pod(00101) = parity(0,0) = 0;    Pev(00101) = parity(0,1,1) = 0;    PodPev = 00;    banknum m = 0
  • d7: Pod(00111) = parity(0,1) = 1;    Pev(00111) = parity(0,1,1) = 0;    PodPev = 10;    banknum m = 2

Insight 2: We can reorder the datapoints such that each quad contains exactly four memory banks m0, m1, m2 and m3 in a consistent order

When the stage number s is even [footnote:], then Pev within any group of four will be either (0,1,0,1) or (1,0,1,0) depending on the (constant 0 or 1) parity of the non-toggling bits. Why? Because we know that bit ds within any group of four always counts (0,1,0,1) while the remaining (non-toggling) even bits remain constant. Similarly, Pod will be either (0,0,1,1) or (1,1,0,0). Bank number PodPev for a quad can thus be one of the following four groups:

  • (00,01,10,11);
  • (01,00,11,10);
  • (10,11,00,01); or
  • (11,10,01,00).
But we want to canonicalize the bank order such that bank numbers always follow the same sequence e.g. (00,01,10,11) for every aligned quad within the stage. To do this, we must conspire to e.g. make Pev count only (0,1,0,1) within each aligned group of four and never (1,0,1,0). Similarly, Pod must always count e.g. (0,0,1,1) and never (1,1,0,0).

Suppose, that, when stage number s is even and when Pev is about to count (1,0,1,0), we change ds (a component of Pev because s is even) such that it counts (1,0,1,0) instead of the usual (0,1,0,1). In other words, instead of d=(d1,d3,d5,d7), where d1 goes (0,1,0,1) we change the sequence to d'=(d3,d1,d7,d5), so that d'1 goes (1,0,1,0). Then P'ev for the new sequence d' will count in the desired order (0,1,0,1).

In general, is it okay to change the sequence like this when needed? Yes it is because 1) within any given operand pair, the order is unimportant, thus as far as the FFT is concerned, (d3,d1,d7,d5) is the same as (d1,d3,d5,d7); and 2) the net effect of changing the polarity of the least-significant toggle bit ds is simply to swap the order of operands in a given pair.

How do we know when Pev is going to count the wrong direction such that ds will need to change? Simple:

 if ds counts   and Pev counts    then do this    or in other words
    0,1,0,1        0,1,0,1          ds' =  ds         ds' = Pev
    0,1,0,1        1,0,1,0          ds' = ~ds         ds' = Pev
    1,0,1,0        0,1,0,1          ds' =  ds         ds' = Pev
    1,0,1,0        1,0,1,0          ds' = ~ds         ds' = Pev

In other words, we simply replace ds with Pev to form the new datapoint sequence d'.    P'ev for the new sequence order will always be (0,1,0,1,   0,1,0,1,   ...)

We also need to canonicalize Pod such that it always counts (0,0,1,1) and never (1,1,0,0). For this, we note when Pod is about to count (1,1,0,0), and then we change ds+1 (a component of Pod because s is even) such that it counts (1,1,0,0) instead of the usual (0,0,1,1). In other words, instead of (d1,d3,d5,d7), where d2 goes (0,0,1,1) we would change the sequence to (d5,d7,d3,d1), so that d2 goes (1,1,0,0).

In general, is it okay to change the sequence like this when needed? Yes it is because 1) within any given stage, the order of the operand pairs is unimportant; as far as the FFT is concerned, (d5,d7,d1,d3) is the same as (d1,d3,d5,d7); and 2) the net effect of changing the polarity of the toggle bit ds+1 is simply to swap the order of operand pairs in a given quad. Following logic similar to that used for Pev, we find that we can canonicalize the order of P'od by replacing ds+1 with Pod.

Using the transformation described above, then, when s is even we can change any data sequence d into an equally-valid data sequence d' such that we achieve a consistent bank ordering (m0,m1,m2,m3,   m0,m1,m2,m3,   ...)    And it turns out that, using the same transformation when s is odd, we achieve a consistent bank ordering (m0,m2,m1,m3,   m0,m2,m1,m3,   ...)

Summary

Somewhat more formally, then, let's define Pi such that

  • Pi = Pev when i is even, and
  • Pi = Pod when i is odd.
Now we see that WE CAN ALWAYS REPLACE ds+1 AND ds WITH Ps+1 AND Ps RESPECTIVELY TO CANONICALIZE THE BANK ORDER. In particular, given this transformation, banks in an aligned even-stage quad will always count (00,01,10,11) and the banks in an aligned odd-stage quad will always count (00,10,01,11).

In our example, (d1,d3,d5,d7) is part of an odd stage; Pod=(0,1,0,1) and Pev=(1,1,0,0). Because Pev counts incorrectly (1,1,0,0) instead of (0,0,1,1), we replace ds+1 with Pev to come up with a new ordering (d5,d7,d1,d3). Now Pod=(0,1,0,1) and Pev=(0,0,1,1) and final bank number assignments read (00,10,01,11), which is the correct canonical order for an odd stage.

This is the core of the proof. However, we still need to place it in a more formal context. Also, we must discuss:

  • the special cases for the boundary condition when s and s+1 wrap around at the end, that is when s = S-1.
  • how the proof extends to any power-of two values for B, D or R maybe;
  • and, for advanced work, we should mention how the proof extends to any pipeline depth.

Boundary condition: non-adjacent toggle bits

The final-stage problem of non-adjacent toggle bits

Note that our insight regarding quads changes slightly in the final stage of operation, when s=S-1. In all previous stages up to that point, the toggling bits are adjacent (see example below).

The Stage-3 problem of non-adjacent toggle bits
Stage 0 Stage 1 Stage 2 Stage 3




Quad 0 d00=0000 d00=0000 d00=0000 d00=0000
d01=0001 d02=0010 d04=0100 d08=1000
d02=0010 d04=0100 d08=1000 d01=0001
d03=0011 d06=0110 d12=1100 d09=1001




Quad 1 d04=0100 d08=1000 d01=0001 d02=0010
d05=0101 d10=1010 d05=0101 d10=1010
d06=0110 d12=1100 d09=1001 d03=0011
d07=0111 d14=1110 d13=1101 d11=1011




... ... ... ... ...
Quad 3 d14=1110 d13=1101 d11=1011 d07=0111
d15=1111 d15=1111 d15=1111 d15=1111

At the final stage s=S-1 (Stage 3 in the example), however, the toggling bits wrap around, such that the toggling bits are the least- and most-significant bits (LSB and MSB), i.e. bits d0 and ds instead of the usual bits ds+1 and ds (because in this case ds+1 does not exist).

This wrap works fine if the number of bits is even, in which case d0 is naturally part of Pev and ds part of Pod, same as in the case where the toggling bits are adjacent. When the number of bits is odd, however, we run into trouble, because now both of the toggling bits in stage S-1 (bits d0 and ds) are even.

Is there an easy way to solve this problem?

Solving the final-stage problem of non-adjacent toggle bits

Suppose we rewrite the Stage 3 addresses in the following way, rotating the lower three bits such that the toggling bits are once again adjacent, e.g. in our example we leave the most-significant bit (bit 3) alone and right-rotate the lower three bits, that is

d3d2d1d0 β†’ d3d0d2d1
where the toggling bits are the MSB d3 and the LSB d0.
Solving the Stage-3 problem of non-adjacent toggle bits.
...................................(Stages 0, 1, 2, same as before)............................. (new/rotated)
Stage 0 Stage 1 Stage 2 Stage 3




Quad 0 d00=0000 d00=0000 d00=0000 d00=0000
d01=0001 d02=0010 d04=0100 d08=1000
d02=0010 d04=0100 d08=1000 d04=0100
d03=0011 d06=0110 d12=1100 d12=1100




Quad 1 d04=0100 d08=1000 d01=0001 d01=0001
d05=0101 d10=1010 d05=0101 d09=1001
d06=0110 d12=1100 d09=1001 d05=0101
d07=0111 d14=1110 d13=1101 d13=1101




... ... ... ... ...
Quad 3 d14=1110 d13=1101 d11=1011 d07=0111
d15=1111 d15=1111 d15=1111 d15=1111

Now we can be sure that the toggling bits are adjacent and thus they will belong to separate parity groups, one as part of parity-even and one as part of parity-odd. But is the rewrite legal?

For the FFT to get a correct result, the only requirement for Stage 3 is that the pairwise operands op1 and op2 be separated by 2s = 8, that is

|op2 - op1| = 2s = 8
By preserving the fastest-toggling bit d3 as the MSB we guarantee that, within each operand pair,
  1. The MSB toggles 0β†’1 or 1β†’0, and
  2. the remaining bits (all of which are toggling more slowly than once per pair) are constant, and thus
  3. the relationship between the two operands op1 and op2 in each pair is such that (|op2 - op1| == 8)

Example for one butterfly unit (B=1)

Below are two examples of the algorithm at work, calculating the schedules for Stages 0 and 3 of a sixteen-point transform using one radix-2 butterfly unit.

Example mapping for stage s=0 of a transform for D=16 datapoints and one butterfly (B=1)

Example mapping for stage s=0 of a transform for D=16 datapoints and one butterfly (B=1)
Datapoint
   d=d3d2d1d0   
Reorder 1 (no change)
d' = d3d2d1d0
Parity
   P'1,P'0   
Reorder 2 (replace)
d" = d'3d'2P'1P'0
Parity
  P"1,P"0  
Memory bank
   m = P"1P"0   
d00= 0 0 0 0
d01= 0 0 0 1
d02= 0 0 1 0
d03= 0 0 1 1
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0,0
0,1
1,0
1,1
0 0 0 0 = d00
0 0 0 1 = d01
0 0 1 0 = d02
0 0 1 1 = d03
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d04= 0 1 0 0
d05= 0 1 0 1
d06= 0 1 1 0
d07= 0 1 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
0,1
0,0
1,1
1,0
0 1 0 1 = d05
0 1 0 0 = d04
0 1 1 1 = d07
0 1 1 0 = d06
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d08= 1 0 0 0
d09= 1 0 0 1
d10= 1 0 1 0
d11= 1 0 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1,1
1,0
0,1
0,0
1 0 1 1 = d11
1 0 1 0 = d10
1 0 0 1 = d09
1 0 0 0 = d08
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d12= 1 1 0 0
d13= 1 1 0 1
d14= 1 1 1 0
d15= 1 1 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
1,0
1,1
0,0
0,1
1 1 1 0 = d14
1 1 1 1 = d15
1 1 0 0 = d12
1 1 0 1 = d13
0,0
0,1
1,0
1,1
m0
m1
m2
m3

Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and one butterfly (B=1)

Example mapping for final stage s=S-1 (Stage 3) of a transform
for D=16 datapoints and one butterfly (B=1)
Datapoint
   d=d3d2d1d0   
Reorder 1 (rotate)
d' = d3d0d2d1
Parity
   P'1,P'0   
Reorder 2 (replace)
d" = P'3P'2d'1d'0
Parity
  P"1,P"0  
Memory bank
   m = P"1P"0   
d00= 0 0 0 0
d08= 1 0 0 0
d01= 0 0 0 1
d09= 1 0 0 1
0 0 0 0
1 0 0 0
0 1 0 0
1 1 0 0
0,0
1,0
0,1
1,1
0 0 0 0 =d00
1 0 0 0 =d08
0 1 0 0 =d04
1 1 0 0 =d12
0,0
1,0
0,1
1,1
m0
m1
m2
m3
d02= 0 0 1 0
d10= 1 0 1 0
d03= 0 0 1 1
d11= 1 0 1 1
0 0 0 1
1 0 0 1
0 1 0 1
1 1 0 1
0,1
1,1
0,0
1,0
0 1 0 1 =d05
1 1 0 1 =d13
0 0 0 1 =d01
1 0 0 1 =d09
0,0
1,0
0,1
1,1
m0
m1
m2
m3
d04= 0 1 0 0
d12= 1 1 0 0
d05= 0 1 0 1
d13= 1 1 0 1
0 0 1 0
1 0 1 0
0 1 1 0
1 1 1 0
1,0
0,0
1,1
0,1
1 0 1 0 =d10
0 0 1 0 =d02
1 1 1 0 =d14
0 1 1 0 =d06
0,0
1,0
0,1
1,1
m0
m1
m2
m3
d06= 0 1 1 0
d14= 1 1 1 0
d07= 0 1 1 1
d15= 1 1 1 1
0 0 1 1
1 0 1 1
0 1 1 1
1 1 1 1
1,1
0,1
1,0
0,0
1 1 1 1 =d15
0 1 1 1 =d07
1 0 1 1 =d11
0 0 1 1 =d03
0,0
1,0
0,1
1,1
m0
m1
m2
m3

Algorithm 1: One radix-2 butterfly (R=2 and B=1)

Given a transform with D datapoints such that D is a power of two and D β‰₯ 4; [SR:] if the algorithm is to work correctly for one butterfly (B=1), we should have four memory banks (M=4) numbered 0 to 3, or 002 to 112 i.e. m1m0={002,012,102,112} i.e. m0=002, m1=012, m2=102 and m3=112.

What we're gonna do.

The following equation(s) will show how to transform a given stage's datapoint sequence

(dp(D-1) dp(D-2)...dp1,dp0)


into an equivalent sequence

(dp'(D-1) dp'(D-2)...dp'1,dp'0)
such that each datapoint dp'i maps to a memory bank mbi where the corresponding memory bank sequence
(mb'(D-1) mb'(D-2)...mb'0,mb'1)
is the repeating series
(m0,m1,m2,m3, m0,m1,m2,m3 ... m0,m1,m2,m3),
and m0 = memory bank 002, m1 = memory bank 012, m2 = memory bank 102 and m3 = memory bank 112.

So:

Step 1: dp β†’ dp'

Each datapoint dps,i in a given transform stage s βˆˆ (0 .. S-1) is an S-bit number

dps,i = dS-1dS-2...d1d0
where i βˆˆ (0..D-1).

If s = S-1 (final stage), we will replace each datapoint

dps,i = dS-1dS-2...d1d0
with a new datapoint
dp's,i = dS-1d0dS-2...d1
such that the lower S-2 bits of the original datapoint have been right-rotated by one bit.

For all other stages dp' is the same as dp. In other words:

     dp's,i = dps,i βˆ€ s ∈ (0 .. S - 2) and i ∈ (0 .. D-1)
     dp's,i = dS-1d0dS-2...d1 βˆ€ s = (S - 1) and i ∈ (0 .. D-1)

Step 2: dp' β†’ dp"

Having made this first transformation

dp's,i = Ζ’(dps,i),
in Step 1 above, we can now redefine each dp's,i as a new datapoint dp"s,i
dp"s,i = d"S-1d"S-2...d"1d"0
where
    if s ∈ (0 .. S-2) then      d"s = P's    and d"(s+1) = P'(s+1)    and all other d"j = d'j , where j ∈ (0..S-1) and j βˆ‰ {s,s+1}
    else (final stage s = S-1)      d"s = P's    and d"s-1 = P's-1    and all other d"j = d'j , where j ∈ (1..S-2)

and

P'0 = P'even = Parity(d'S-2,d'S-4...d'2,d'0)
is the parity of the even bits of dp', that is all those bits d'j such that j mod 2 = 0; and
P'1 = P'odd = Parity(d'S-1,d'S-3...d'3,d'1)
is the parity of the odd bits of dp', that is all those bits d'j such that j mod 2 = 1; and
    P'k = P'0 = P'even    if k is even and
    P'k = P'1 = P'odd    if k is odd

Step 3?: m = P"1P"0 ?

OOPS maybe need to add step 3 where Memory bank m = P"1P"0   

Extension of the proof to more than one butterfly unit (B>1)

Having solved the problem for B=1 and M=4, the extension to larger values of B is fairly straightforward.

For our algorithm to work, there must always be four memory banks for each butterfly unit. For one butterfly (B=1) we had four memory banks (M=4) and thus two toggling bits. When there are two butterfly units (B=2) we have eight memory banks (M=8) and thus three toggling bits, and so on, adding one more toggling bit for each doubling of B.

Following the logic we used to develop the algorithm for B=1 above, we can continue to define d' = d for those stages where the toggling bits are adjacent and do not wrap around the high end of the address, as in e.g. stages 0 and 1 of a two-butterfly implementation of a 16-point transform with three toggling bits.

<caption>With three toggle bits, non-adjacency begins in Stage 2</caption>
Stage 0 Stage 1 Stage 2 Stage 3




Oct 0 d00=0000 d00=0000 d00=0000 d00=0000
d01=0001 d02=0010 d04=0100 d08=1000
d02=0010 d04=0100 d08=1000 d01=0001
d03=0011 d06=0110 d12=1100 d09=1001
d04=0100 d08=1000 d01=0001 d02=0010
d05=0101 d10=1010 d05=0101 d10=1010
d06=0110 d12=1100 d09=1001 d03=0011
d07=0111 d14=1110 d13=1101 d11=1011




... ... ... ... ...
Oct 1 d14=1110 d13=1101 d11=1011 d07=0111
d15=1111 d15=1111 d15=1111 d15=1111

Note, extending the introductory explanation to larger values of B, because we have eight memory banks we now have two "octs" (groups of eight) with three toggle bits each instead of four "quads" (groups of four) with two toggle bits.

Stages 2 and 3 now both have non-adjacent toggle bits, and in each of these stages we have to wrap the lower address bits such that the low toggling bits adjoin the high toggling bits, as in the example below.

<caption>With three toggle bits, non-adjacency must be corrected in Stages 2 and 3</caption>
(Stages 0 and 1 same as before) (New/rotated Stages 2 and 3)
Stage 0 Stage 1 Stage 2 Stage 3




Oct 0 d00=0000 d00=0000 d00=0000 d00=0000
d01=0001 d02=0010 d04=0100 d08=1000
d02=0010 d04=0100 d08=1000 d02=0010
d03=0011 d06=0110 d12=1100 d10=1010
d04=0100 d08=0001 d02=0010 d04=0100
d05=0101 d10=0011 d06=0110 d12=1100
d06=0110 d12=0101 d10=1010 d06=0110
d07=0111 d14=0111 d14=1110 d14=1110




... ... ... ... ...
Oct 1 d14=1110 d13=1101 d11=1011 d07=0111
d15=1111 d15=1111 d15=1111 d15=1111

Our original discussion for a one-butterfly FFT talked about quads mapping to four memory banks. For more than one butterfly we update this discussion with octs that map to eight memory banks, in the case of two butterfly units, or sixteens mapping to sixteen banks when B=4, etc. The algorithm works so long as there are four memory banks for every butterfly unit, i.e. M=4B.

Similarly, when B=2 (two butterflies) we have three toggle bits instead of two, and when B=4 we have four toggle bits, and so on. And instead of simply having odd and even parity, as in the original case where B=1, we need to define some new terms for parity, that is

  • P30 = the parity of every third bit including bit b0, i.e. P30 = XOR(b0, b3, b6, ...)
  • P31 = the parity of every third bit including bit b1, i.e. P31 = XOR(b1, b4, b7, ...)
or, more generally,
  • Pni = the parity of every nth bit bj such that (j mod n) = (i mod n)
Note that, by this definition,
  • Peven = P20 = P22 = P24 = ...
  • Podd = P21 = P23 = P25 = ...
  • P30 = P33 = P36 = P39 = ...
  • and so on
In the case where there was one butterfly unit (B=1), we could canonicalize the bank order by replacing the two toggle bits ti and ti+1 with parity bits P2i and P2i+1.

In the case where there are two butterfly units (B=2), there are now three toggle bits ti, ti+1 and ti+2; to canonicalize the bank order (eight banks now instead of four), we replace these toggle bits with parity bits P3i, P2i+1 and P2i+2

In general, then, when B=2n, there will be 4B memory banks and n+1 toggle bits titi+1ti+2..ti+n which we replace with parity bits PniPni+1Pni+2..Pni+n in order to remap the operand addresses such that each sequential group of M operands in a given stage have the exact same memory-bank sequence.

Example for more than one butterfly unit (B=2)

Below are two examples of the algorithm at work, calculating the schedules for Stages 0 and 3 of a sixteen-point transform using two radix-2 butterfly units.

Example mapping for stage s=0 of a transform for D=16 datapoints and two butterflies (B=2)

<caption>Example mapping for stage s=0 of a transform for D=16 datapoints and two butterflies (B=2)</caption>
Datapoint
   d=d3d2d1d0   
Reorder 1 (no change)
d' = d3d2d1d0
Parity (P=P3)
   P'2,P'1,P'0   
Reorder 2 (replace)
d" = d'3P'2P'1P'0
Parity (P=P3)
  P"2,P"1,P"0   ! Memory bank
   m = P"2P"1P"0   
d00= 0 0 0 0
d01= 0 0 0 1
d02= 0 0 1 0
d03= 0 0 1 1
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0,0,0
0,0,1
0,1,0
0,1,1
0 0 0 0 = d00
0 0 0 1 = d01
0 0 1 0 = d02
0 0 1 1 = d03
0,0,0
0,0,1
0,1,0
0,1,1
m0
m1
m2
m3
d04= 0 1 0 0
d05= 0 1 0 1
d06= 0 1 1 0
d07= 0 1 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1,0,0
1,0,1
1,1,0
1,1,1
0 1 0 0 = d04
0 1 0 1 = d05
0 1 1 0 = d06
0 1 1 1 = d07
1,0,0
1,0,1
1,1,0
1,1,1
m4
m5
m6
m7
d08= 1 0 0 0
d09= 1 0 0 1
d10= 1 0 1 0
d11= 1 0 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
0,0,1
0,0,0
0,1,1
0,1,0
1 0 0 1 = d09
1 0 0 0 = d08
1 0 1 1 = d11
1 0 1 0 = d10
0,0,0
0,0,1
0,1,0
0,1,1
m0
m1
m2
m3
d12= 1 1 0 0
d13= 1 1 0 1
d14= 1 1 1 0
d15= 1 1 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
1,0,1
1,0,0
1,1,1
1,1,0
1 1 0 1 = d13
1 1 0 0 = d12
1 1 1 1 = d15
1 1 1 0 = d14
1,0,0
1,0,1
1,1,0
1,1,1
m4
m5
m6
m7

Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and two butterflies (B=2)

<caption>Example mapping for final stage s=S-1 (Stage 3) of a transform<br /> for D=16 datapoints and two butterflies (B=2)</caption>
Datapoint
   d=d3d2d1d0   
Reorder 1 (rotate)
d' = d3d1d0d2
Parity (P=P3)
   P'2,P'1,P'0   
Reorder 2 (replace)
d" = P'3P'2P'1d'0
Parity (P=P3)
  P"2,P"1,P"0   ! Memory bank
   m = P"2P"1P"0   
d00= 0 0 0 0
d08= 1 0 0 0
d01= 0 0 0 1
d09= 1 0 0 1
0 0 0 0
1 0 0 0
0 0 1 0
1 0 1 0
0,0,0
0,0,1
0,1,0
0,1,1
0 0 0 0 =d00
1 0 0 0 =d08
0 0 1 0 =d02
1 0 1 0 =d10
0,0,0
0,0,1
0,1,0
0,1,1
m0
m1
m2
m3
d02= 0 0 1 0
d10= 1 0 1 0
d03= 0 0 1 1
d11= 1 0 1 1
0 1 0 0
1 1 0 0
0 1 1 0
1 1 1 0
1,0,0
1,0,1
1,1,0
1,1,1
0 1 0 0 =d04
1 1 0 0 =d12
0 1 1 0 =d06
1 1 1 0 =d14
1,0,0
1,0,1
1,1,0
1,1,1
m4
m5
m6
m7
d04= 0 1 0 0
d12= 1 1 0 0
d05= 0 1 0 1
d13= 1 1 0 1
0 0 0 1
1 0 0 1
0 0 1 1
1 0 1 1
0,0,1
0,0,0
0,1,1
0,1,0
1 0 0 1 =d09
0 0 0 1 =d01
1 0 1 1 =d11
0 0 1 1 =d03
0,0,0
0,0,1
0,1,0
0,1,1
m0
m1
m2
m3
d06= 0 1 1 0
d14= 1 1 1 0
d07= 0 1 1 1
d15= 1 1 1 1
0 1 0 1
1 1 0 1
0 1 1 1
1 1 1 1
1,0,1
1,0,0
1,1,1
1,1,0
1 1 0 1 =d13
0 1 0 1 =d05
1 1 1 1 =d15
0 1 1 1 =d07
1,0,0
1,0,1
1,1,0
1,1,1
m4
m5
m6
m7

Algorithm 2: More than one radix-2 butterfly (R=2 and B>1)

Given a transform with D datapoints such that D is a power of two and D β‰₯ 4; [SR:] if the algorithm is to work correctly for any power-of-two number of butterfly units B=2n, we should have 4B memory banks (M=4B) numbered 0 to (4B-1), or 00...002 to 11...112 i.e.

  Bank Name  
 
  Bank Number  
m
Bank Bits
mn+1mnmn-1...m0
m0 (m=0)    00...002   =   0
m1 (m=1)    00...012   =   1
↓ ↓    ↓   
mn (m=4B-2)    11...102   =   4B-2
m(n+1) (m=4B-1)    11...112   =   4B-1

Each operand address dps,i in a given stage s has T toggle bits where T=log2M

What we're gonna do (B>1)

The following equation(s) will show how to transform a given stage's datapoint sequence

(dp(D-1) dp(D-2)...dp1,dp0)


into an equivalent sequence

(dp'(D-1) dp'(D-2)...dp'1,dp'0)
such that each datapoint dp'i maps to a memory bank mbi where the corresponding memory bank sequence
(mb'(D-1) mb'(D-2)...mb'0,mb'1)
is the repeating series
(
(m0,m1,m2 ... m(4B-1)),
(m0,m1,m2 ... m(4B-1)),
(m0,m1,m2 ... m(4B-1)),
...)
and m0 = memory bank 00...002, m1 = memory bank 00...012, m(4B-2) = memory bank 11...102 and m(4B-1) = memory bank 11...112.

So:

Step 1: dp β†’ dp' (B>1)

Each datapoint dps,i in a given transform stage s βˆˆ (0 .. S-1) is an S-bit number

dps,i = dS-1dS-2...d1d0
where i βˆˆ (0..D-1).

If s ∈ (S-(T-1)..S-1) (i.e. final T-1 stages), we will replace each datapoint

dps,i = (dS-1dS-2...ds)(ds-1ds-2...dT-1)(dT-2dT-3...d0)
with a new datapoint
dp's,i = (dS-1dS-2...ds)(dT-2dT-3...d0)(ds-1ds-2...dT-1)
such that the lower s-1 bits of the original datapoint have been right-rotated by (T + (s-S)) bits.

For all other stages dp' is the same as dp. In other words:

     dp's,i = dps,i βˆ€ s ∈ (0 .. S-T) and i ∈ (0 .. D-1)
     dp's,i = (dS-1dS-2...ds)(dT-2dT-3...d0)(ds-1ds-2...dT-1) βˆ€ s ∈ (S - (T-1) .. S-1) and i ∈ (0 .. D-1)

Step 2: dp' β†’ dp" (B>1)

Having made this first transformation

dp's,i = Ζ’(dps,i),
in Step 1 above, we can now redefine each dp's,i as a new datapoint dp"s,i
dp"s,i = d"S-1d"S-2...d"1d"0
where
    if s ∈ (0 .. S-T) then      d"sd"s+1d"s+2..d"s+(T-1) = P'sP's+1P's+2..P's+(T-1)    and all other d"j = d'j
    else (final T-1 stages)      d"sd"s -1d"s -2..d"s -(T-1) = P'sP's -1P's -2..P's -(T-1)    and all other d"j = d'j

and

P'i = PTi = the parity of every T th bit d'j such that (j mod T) = (i mod T)
e.g. when T=3 then
P'0 = P30 = XOR(d'0, d'3, d'6, ...)
P'1 = P31 = XOR(d'1, d'4, d'7, ...)
P'2 = P32 = XOR(d'2, d'5, d'8, ...)
P'3 = P30 = XOR(d'0, d'3, d'6, ...)
etc.

Extension to radix R>2, longer pipelines

Extension to R>2

I think the case for radix R>2 is exactly analogous to the case where B>1. That is, for e.g. a one-butterfly radix-four operation (B=1 and R=4), the operands are grouped by four in exactly the way that they would be grouped for a two-butterfly radix-two operation (B=2 and R=2). In other words, the grouping for [TODO/FIXME]

For radix-2 operation we needed M=4B memory banks for groups of 4B operands. In the general case, where R can be any power of 2 greater than or equal to 2, we need M=4B(R-2) memory banks. Note also that S and s must change appropriately; the remainder of the proof for radix R>2 is left as an exercise to the reader.

Extension to longer pipelines.

Oops still need to write something here maybe.

Nomenclature and Terminology

Nomenclature (Symbols)

We will try to use symbols consistently throughout this writeup, as described by this alphabetically-ordered reference table.

Italicized symbols (B, D, d, M) represent numbers, non-italicized symbols (d0, m3) are names.

lower-case symbols (d, m) are variable, UPPER-CASE symbols (D, M) are constant (for a given FFT).

Symbol Meaning Notes
B How many butterfly units A power of two, e.g. 8, 16, 1024 or 8096. To complete a transform, an FFT with e.g. one butterfly unit (B=1) takes about four times as long as an FFT with four butterfly units (B=4).
D How many datapoints Number of datapoints to transform. A power of two, e.g. 8, 16, 1024 or 8096.
d Datapoint An eight-point transform has datapoints d=0 through d=7.
di A specific datapoint A specific datapoint i. For example, d4 is the datapoint d such that d == 4.
An eight-point transform has datapoints d0 through d7 (see sched(s) below).
di Datapoint bit number i d0 is always the least-significant bit (LSB). Example: datapoint d13 has four bits d3d2d1d0 = 11012   
dps,i Datapoint i in stage s Same as sched(s)i (see below). Each datapoint dps,i in a given transform stage s βˆˆ (0 .. S-1) is an S-bit number
dps,i = dS-1dS-2...d1d0, where i βˆˆ (0..D-1).
dpi dps,i with implied s Shorthand for dps,i where exact stage s is implied by surrounding context.
i, j, k, n Unknown integer We usually start numbering from i=0
M How many memory banks For purposes of this proof M == 4B always.
m Memory bank Generally need four memory banks per butterfly e.g. m=0 through m=7 for two butterflies (B=2).
mi A specific memory bank Generally need four memory banks per butterfly e.g. m0 through m7 for two butterflies (B=2).
mi Memory bank bit number i Generally need four memory banks per butterfly e.g. m2m1m0={000,001 ... 111} for two butterflies (B=2).
Mbits Bits per bank number How many bits in each mem bank number m; Mbits = log2(M)
mbs,i Memory bank of dps,i The memory bank corresponding to datapoint i of stage s.
mbi mbs,i with implied s Shorthand for mbs,i where exact stage s is implied by surrounding context. For a one-butterfly eight-point transform, our algorithm results in three stages
Stage 0: (mb0,mb1,mb2,mb3,mb4,mb5,mb6,mb7) = (m0,m1,m2,m3, m0,m1,m2,m3)
Stage 1: (mb0,mb1,mb2,mb3,mb4,mb5,mb6,mb7) = (m0,m2,m1,m3, m0,m2,m1,m3)
Stage 2: (mb0,mb1,mb2,mb3,mb4,mb5,mb6,mb7) = (m0,m1,m2,m3, m0,m1,m2,m3)
Pod,
Podd
Parity of odd bits Example: decimal 13 has four bits d3d2d1d0 = 1101; thus Pod is parity(d3,d1) = parity(1,0) = 1
Note Podd is the same as P21 (below).
Pev,
Peven
Parity of even bits Example: decimal 13 has four bits d3d2d1d0 = 1101; thus Pev is parity(d2,d0) = parity(1,1) = 0
Note Peven is the same as P20 (below).
Pni Parity of periodic bits The parity of every nth bit bj such that (j mod n) = (i mod n)
Example: P30 = the parity of every third bit including bit b0, i.e. P30 = XOR(b0, b3, b6, ...)
Note that, by this definition, P30 = P33 = P36 = P39 and so on.
R Radix Default is 2
S How many stages Number of stages in a transform, e.g. an eight-point transform has three stages, so S=3
s Stage number An eight-point transform has three stages s=0, s=1 and s=2.
sched(s) Schedule for stage s The FFT schedule for stage s of a transform. E.g. for an 8-point transform there are three stages
   sched(0) = (d0,d1,d2,d3,d4,d5,d6,d7)
   sched(1) = (d0,d2,d4,d6,d1,d3,d5,d7) and
   sched(2) = (d0,d4,d1,d5,d2,d6,d3,d7)
 sched(s)i  i th element of sched(s) The i th element of sched(s), e.g. in an 8-point transform, sched(1)1 = d2 (see above).
T How many toggle bits The number of toggle bits in each datapoint address. T == log2M
ti Toggle bit number i t0 is always the fastest-toggling (least significant) toggle bit. Example: a one-butterfly eight-point transform has two toggle bits. In Stage 0, t1t0 = d1d0; in Stage 1, t1t0 = d2d1; and in Stage 2, t1t0 = d0d2.

Terminology

Term Meaning Notes
LSB Least-significant bit The least-significant bit in the binary representation of a given number. All even integers have LSB = 0 and all odd integers have LSB = 1.
MSB Most-significant bit The most-significant bit of a number depends on the number of bits used to represent the binary form of that number. E.g. if 2010 is represented using five bits 101002, the MSB is 1, but if it is represented as eight bits 0001 01002 the MSB is 0.
oct? group of eight? aligned vs. unaligned?
quad? group of four? aligned vs. unaligned?
others?

Appendix: Perl program that implements the algorithm

The algorithm as implemented in the Perl test program (FIXME/TODO: NOTE BENE THIS IS NOT CORRECT, NEEDS SEVERE UPDATING)

&#35; Normal treatment for stages 0 through nstages&#45;2
if ($s &#33;&#61; ($nstages&#45;1)) &#123;
    $op1 &#61; smart_replace($op1, $s)&#59;
    $op1 &#61; smart_replace($op1, $s+1)&#59;

    $op2 &#61; smart_replace($op2, $s)&#59;
    $op2 &#61; smart_replace($op2, $s+1)&#59;
&#125;

&#35; Last stage nstages&#45;1 needs special treatment.
elsif ($s%2 &#61;&#61; 1) &#123;

    &#35; Even number of bits&#59; use zero instead of s+1 maybe.
    $op1 &#61; smart_replace($op1, $s)&#59;
    $op1 &#61; smart_replace($op1, 0)&#59;

    $op2 &#61; smart_replace($op2, $s)&#59;
    $op2 &#61; smart_replace($op2, 0)&#59;
&#125;
else &#123;
    &#35; Odd number of bits&#59; gotta do sumpm crazy maybe.
    $op1 &#61; weird_replace($op1, $s)&#59;
    $op2 &#61; weird_replace($op2, $s)&#59;
&#125;

&#35; Swizzle to assign correct bank &amp; row to $op1, $op2
my ($b1, $r1) &#61; swizzler&#58;&#58;swizzle($op1, $npoints, 4&#42;$nunits)&#59;
my ($b2, $r2) &#61; swizzler&#58;&#58;swizzle($op2, $npoints, 4&#42;$nunits)&#59;

&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;
sub smart_replace &#123;
    &#35; In the given number $n, replace bit $i with new bit $b
    &#35; If $i is  odd, new bit $b is the parity of the  odd bits of $n
    &#35; If $i is even, new bit $b is the parity of the even bits of $n
    &#35;
    my $n &#61; shift&#59;
    my $i &#61; shift&#59;

    my $newbit &#61; $i%2 ? parity_odd($n) &#58; parity_even($n)&#59;
    $n &#61; replace_bit($n, $i, $newbit)&#59;    &#35; Put $newbit into $n at position $i
    return $n&#59;
&#125;

sub weird_replace &#123;
    &#35; Given number $n and ultimate stage $s, where $s is even, do some weird $&#35;&#33;?
    my $n &#61; shift&#59;
    my $s &#61; shift&#59;

    &#35; This is supposed to work at least for when nbits(datapoints) is odd.  i think.
    my $b0 &#61; ($n &amp; 0x1)&#59;
    my $b1 &#61; ($n &amp; 0x2) &gt;&gt; 1&#59;

    my $new_top_bit &#61; parity_even($n) ^ $b1&#59;
    my $new_b1 &#61; parity_odd($n) ^ $b1 ^ $b0&#59;
    my $new_b0 &#61; $b1 ^ $b0&#59;  &#35; Complicated enough yet&#33;?

    $n &#61; replace_bit($n,$s,$new_top_bit)&#59; &#35; Put $new_top_bit into $n at position $s
    $n &#61; replace_bit($n, 1,$new_b1)&#59;      &#35; Put $new_b1 into $n at position 1
    $n &#61; replace_bit($n, 0,$new_b0)&#59;      &#35; Put $new_b0 into $n at position 0 (LSB right?)

    return $n&#59;
&#125;

Major caveat(s), to-do and notes

Caveat: The Stanford FFT generator, which I authored, may not yet actually use the algorithm that results directly from this proof. But I plan to do that eventually...!

Todo:

  • present to the group and invite comments including where/whether to submit to journal, conference, etc.
  • update the Perl program in the appendix.
  • change to latex format for publication (replace html page with a latex page).
  • make sure FFT generator (correctly) uses algorithm described in the formal proof!

Other Notes, Trash and Auxiliary Info

Projects: FFT Formal Proof v0 (old) - Old formal proof with separate boundary conditions for odd and even number of bits, etc.

Projects: FFT Formal Proof - Auxiliary Info

Projects: FFT_Formal_Proof - Auxiliary_Info#Trash

Original hand-written notes

Hand-written notes where the proof originally was developed:

https://vlsiweb.stanford.edu/mediawiki/index.php/File:Formal_proof_notes.pdf

⚠️ **GitHub.com Fallback** ⚠️