Projects: FFT Formal Proof - StanfordVLSI/Genesis2 GitHub Wiki
I mean, sure, you can read it if you want to, but it's pretty awful. All the information is there, but it's just not arranged very well (IMO as the author). I'm starting all over with a second attempt here:
Meanwhile, continue reading this paper if you like, but read on at your own risk...paper continues below...A detailed description of the problem we want to solve can be found in the 2013 VLSI-SoC paper
To summarize the problem here: take a standard schedule of operations for a Fast Fourier Transform processor (FFT) based on a radix-2 Cooley-Tukey algorithm [ref], e.g. for D=8 datapoints and one butterfly unit, as shown in Figure 1. (Note: for your convenience, symbols and conventions used in the ensuing discussion are elaborated in the Nomenclature section at the end of the paper). FIGURE 1That is, suppose we have eight datapoints d0 through d7 that we want to transform from the time domain to the frequency domain. In Stage 0 of operation, a radix-2 butterfly unit reads datapoints d0 and d1, producing two results that then overwrite the original locations of d0 and d1 in memory. Similar operations occur for datapoint pairs (d2,d3), (d4,d5) and (d6,d7) respectively. In Stage 1, the datapoints are processed with stride-two pairings e.g. (d0,d2), (d4,d6), (d1,d3) and (d5,d7). And finally Stage 2 completes the operation, re-walking the data by stride 4. Table 1 summarizes this access pattern.
TABLE 1 Stage 0: (d0,d1) (d2,d3) (d4,d5) (d6,d7) Stage 1: (d0,d2) (d4,d6) (d1,d3) (d5,d7) Stage 2: (d0,d4) (d1,d5) (d2,d6) (d3,d7)
If the butterfly unit is pipelined such that result writebacks occur at the same time as operand reads, then datapoints (d0,d1) get written to memory at the same time as (d2,d3) are being read from memory. For memory to be able to supply all four datapoints at once, it needs to be four-ported. Alternatively, we could try to use two separate banks of two-ported memory or even four banks of single-ported memory. Of these three choices, four banks of single-ported memory is most efficient in terms of area, so we'd like to make that work if possible.
Now: Within any stage, look at any sequential group of four operands starting on a mod-two boundary. We'll call such a group a quad. If the quad happens to start on a mod-four boundary, we'll call this an aligned quad. Stage 0 has three overlapping quads: (d0,d1,d2,d3), (d2,d3,d4,d5) and (d4,d5,d6,d7), the first and last of which are aligned quads. Our goal is to map each datapoint to one of four memory banks (m0,m1,m2,m3) such that every quad contains exactly one of each bank number m0, m1, m2 and m3. For example, we could map datapoints (d0,d1,d2,d3,d4,d5,d6,d7) to memory banks (m0,m1,m2,m3,m0,m1,m2,m3) respectively, and thus fulfill our goal for each of the three quads in Stage 0:
- (d0,d1,d2,d3) maps to (m0,m1,m2,m3)
- (d2,d3,d4,d5) maps to (m2,m3,m0,m1)
- (d4,d5,d6,d7) maps to (m0,m1,m2,m3)
We're going to fulfill our goal by enforcing an order to each aligned quad, such that every aligned quad within a stage has the same memory bank ordering e.g. (m0,m1,m2,m3). This means that every unaligned quad will also have a consistent ordering, e.g. (m2,m3,m1,m0), and thus we meet our goal of uniqueness for all quads. This property will hold whether the transform is operating on eight datapoints, with two aligned quads per stage, or eight thousand datapoints, with two thousand quads per stage.
In theory the four operations in a given stage all happen simultaneously. In practice, it's not always practical to have enough butterfly units for that. So for instance, if we only have one butterfly unit, it will have to be used four times in succession to complete the transformation for a given stage, using the following four operations:
- read, process and writeback datapoints d0 and d1;
- read, process and writeback datapoints d2 and d3;
- read, process and writeback datapoints d4 and d5;
- read, process and writeback datapoints d6 and d7.
Insight 1: We can ensure that each of the four memory banks m0, m1, m2 and m3 are represented in every aligned quad, in some order
Once again, here is our schedule for D=8 datapoints:
- Stage 0: (d0,d1) (d2,d3) (d4,d5) (d6,d7)
- Stage 1: (d0,d2) (d4,d6) (d1,d3) (d5,d7)
- Stage 2: (d0,d4) (d1,d5) (d2,d6) (d3,d7)
Given this information, we can assign banks to operands such that banknum = PodPev where Pod is the parity of ALL odd bits and Pev is the parity of ALL even bits. This means that any given sequential block of four operands will map to four separate banks.
In our example quad [(d1,d3),(d5,d7)], or [(00001,00011),(00101,00111)]
- d1: Pod(00001) = parity(0,0) = 0; Pev(00001) = parity(0,0,1) = 1; PodPev = 01; banknum m = 1
- d3: Pod(00011) = parity(0,1) = 1; Pev(00011) = parity(0,0,1) = 1; PodPev = 11; banknum m = 3
- d5: Pod(00101) = parity(0,0) = 0; Pev(00101) = parity(0,1,1) = 0; PodPev = 00; banknum m = 0
- d7: Pod(00111) = parity(0,1) = 1; Pev(00111) = parity(0,1,1) = 0; PodPev = 10; banknum m = 2
Insight 2: We can reorder the datapoints such that each quad contains exactly four memory banks m0, m1, m2 and m3 in a consistent order
When the stage number s is even [footnote:], then Pev within any group of four will be either (0,1,0,1) or (1,0,1,0) depending on the (constant 0 or 1) parity of the non-toggling bits. Why? Because we know that bit ds within any group of four always counts (0,1,0,1) while the remaining (non-toggling) even bits remain constant. Similarly, Pod will be either (0,0,1,1) or (1,1,0,0). Bank number PodPev for a quad can thus be one of the following four groups:
- (00,01,10,11);
- (01,00,11,10);
- (10,11,00,01); or
- (11,10,01,00).
Suppose, that, when stage number s is even and when Pev is about to count (1,0,1,0), we change ds (a component of Pev because s is even) such that it counts (1,0,1,0) instead of the usual (0,1,0,1). In other words, instead of d=(d1,d3,d5,d7), where d1 goes (0,1,0,1) we change the sequence to d'=(d3,d1,d7,d5), so that d'1 goes (1,0,1,0). Then P'ev for the new sequence d' will count in the desired order (0,1,0,1).
In general, is it okay to change the sequence like this when needed? Yes it is because 1) within any given operand pair, the order is unimportant, thus as far as the FFT is concerned, (d3,d1,d7,d5) is the same as (d1,d3,d5,d7); and 2) the net effect of changing the polarity of the least-significant toggle bit ds is simply to swap the order of operands in a given pair.
How do we know when Pev is going to count the wrong direction such that ds will need to change? Simple:
if ds counts and Pev counts then do this or in other words 0,1,0,1 0,1,0,1 ds' = ds ds' = Pev 0,1,0,1 1,0,1,0 ds' = ~ds ds' = Pev 1,0,1,0 0,1,0,1 ds' = ds ds' = Pev 1,0,1,0 1,0,1,0 ds' = ~ds ds' = Pev
In other words, we simply replace ds with Pev to form the new datapoint sequence d'. P'ev for the new sequence order will always be (0,1,0,1, 0,1,0,1, ...)
We also need to canonicalize Pod such that it always counts (0,0,1,1) and never (1,1,0,0). For this, we note when Pod is about to count (1,1,0,0), and then we change ds+1 (a component of Pod because s is even) such that it counts (1,1,0,0) instead of the usual (0,0,1,1). In other words, instead of (d1,d3,d5,d7), where d2 goes (0,0,1,1) we would change the sequence to (d5,d7,d3,d1), so that d2 goes (1,1,0,0).
In general, is it okay to change the sequence like this when needed? Yes it is because 1) within any given stage, the order of the operand pairs is unimportant; as far as the FFT is concerned, (d5,d7,d1,d3) is the same as (d1,d3,d5,d7); and 2) the net effect of changing the polarity of the toggle bit ds+1 is simply to swap the order of operand pairs in a given quad. Following logic similar to that used for Pev, we find that we can canonicalize the order of P'od by replacing ds+1 with Pod.
Using the transformation described above, then, when s is even we can change any data sequence d into an equally-valid data sequence d' such that we achieve a consistent bank ordering (m0,m1,m2,m3, m0,m1,m2,m3, ...) And it turns out that, using the same transformation when s is odd, we achieve a consistent bank ordering (m0,m2,m1,m3, m0,m2,m1,m3, ...)
Somewhat more formally, then, let's define Pi such that
- Pi = Pev when i is even, and
- Pi = Pod when i is odd.
In our example, (d1,d3,d5,d7) is part of an odd stage; Pod=(0,1,0,1) and Pev=(1,1,0,0). Because Pev counts incorrectly (1,1,0,0) instead of (0,0,1,1), we replace ds+1 with Pev to come up with a new ordering (d5,d7,d1,d3). Now Pod=(0,1,0,1) and Pev=(0,0,1,1) and final bank number assignments read (00,10,01,11), which is the correct canonical order for an odd stage.
This is the core of the proof. However, we still need to place it in a more formal context. Also, we must discuss:
- the special cases for the boundary condition when s and s+1 wrap around at the end, that is when s = S-1.
- how the proof extends to any power-of two values for B, D or R maybe;
- and, for advanced work, we should mention how the proof extends to any pipeline depth.
Note that our insight regarding quads changes slightly in the final stage of operation, when s=S-1. In all previous stages up to that point, the toggling bits are adjacent (see example below).
Stage 0 | Stage 1 | Stage 2 | Stage 3 | |
---|---|---|---|---|
|
|
|
|
|
Quad 0 | d00=0000 | d00=0000 | d00=0000 | d00=0000 |
d01=0001 | d02=0010 | d04=0100 | d08=1000 | |
d02=0010 | d04=0100 | d08=1000 | d01=0001 | |
d03=0011 | d06=0110 | d12=1100 | d09=1001 | |
|
|
|
|
|
Quad 1 | d04=0100 | d08=1000 | d01=0001 | d02=0010 |
d05=0101 | d10=1010 | d05=0101 | d10=1010 | |
d06=0110 | d12=1100 | d09=1001 | d03=0011 | |
d07=0111 | d14=1110 | d13=1101 | d11=1011 | |
|
|
|
|
|
... | ... | ... | ... | ... |
Quad 3 | d14=1110 | d13=1101 | d11=1011 | d07=0111 |
d15=1111 | d15=1111 | d15=1111 | d15=1111 |
At the final stage s=S-1 (Stage 3 in the example), however, the toggling bits wrap around, such that the toggling bits are the least- and most-significant bits (LSB and MSB), i.e. bits d0 and ds instead of the usual bits ds+1 and ds (because in this case ds+1 does not exist).
This wrap works fine if the number of bits is even, in which case d0 is naturally part of Pev and ds part of Pod, same as in the case where the toggling bits are adjacent. When the number of bits is odd, however, we run into trouble, because now both of the toggling bits in stage S-1 (bits d0 and ds) are even.
Is there an easy way to solve this problem?
Suppose we rewrite the Stage 3 addresses in the following way, rotating the lower three bits such that the toggling bits are once again adjacent, e.g. in our example we leave the most-significant bit (bit 3) alone and right-rotate the lower three bits, that is
- d3d2d1d0 β d3d0d2d1
...................................(Stages 0, 1, 2, same as before)............................. | (new/rotated) | |||
---|---|---|---|---|
Stage 0 | Stage 1 | Stage 2 | Stage 3 | |
|
|
|
|
|
Quad 0 | d00=0000 | d00=0000 | d00=0000 | d00=0000 |
d01=0001 | d02=0010 | d04=0100 | d08=1000 | |
d02=0010 | d04=0100 | d08=1000 | d04=0100 | |
d03=0011 | d06=0110 | d12=1100 | d12=1100 | |
|
|
|
|
|
Quad 1 | d04=0100 | d08=1000 | d01=0001 | d01=0001 |
d05=0101 | d10=1010 | d05=0101 | d09=1001 | |
d06=0110 | d12=1100 | d09=1001 | d05=0101 | |
d07=0111 | d14=1110 | d13=1101 | d13=1101 | |
|
|
|
|
|
... | ... | ... | ... | ... |
Quad 3 | d14=1110 | d13=1101 | d11=1011 | d07=0111 |
d15=1111 | d15=1111 | d15=1111 | d15=1111 |
Now we can be sure that the toggling bits are adjacent and thus they will belong to separate parity groups, one as part of parity-even and one as part of parity-odd. But is the rewrite legal?
For the FFT to get a correct result, the only requirement for Stage 3 is that the pairwise operands op1 and op2 be separated by 2s = 8, that is
- |op2 - op1| = 2s = 8
- The MSB toggles 0β1 or 1β0, and
- the remaining bits (all of which are toggling more slowly than once per pair) are constant, and thus
- the relationship between the two operands op1 and op2 in each pair is such that (|op2 - op1| == 8)
Below are two examples of the algorithm at work, calculating the schedules for Stages 0 and 3 of a sixteen-point transform using one radix-2 butterfly unit.
Datapoint d=d3d2d1d0 |
Reorder 1 (no change) d' = d3d2d1d0 |
Parity P'1,P'0 |
Reorder 2 (replace) d" = d'3d'2P'1P'0 |
Parity P"1,P"0 |
Memory bank m = P"1P"0 |
---|---|---|---|---|---|
d00= 0 0 0 0 d01= 0 0 0 1 d02= 0 0 1 0 d03= 0 0 1 1 |
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 |
0,0 0,1 1,0 1,1 |
0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d04= 0 1 0 0 d05= 0 1 0 1 d06= 0 1 1 0 d07= 0 1 1 1 |
0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 |
0,1 0,0 1,1 1,0 |
0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d08= 1 0 0 0 d09= 1 0 0 1 d10= 1 0 1 0 d11= 1 0 1 1 |
1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 |
1,1 1,0 0,1 0,0 |
1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d12= 1 1 0 0 d13= 1 1 0 1 d14= 1 1 1 0 d15= 1 1 1 1 |
1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 |
1,0 1,1 0,0 0,1 |
1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and one butterfly (B=1)
Datapoint d=d3d2d1d0 |
Reorder 1 (rotate) d' = d3d0d2d1 |
Parity P'1,P'0 |
Reorder 2 (replace) d" = P'3P'2d'1d'0 |
Parity P"1,P"0 |
Memory bank m = P"1P"0 |
---|---|---|---|---|---|
d00= 0 0 0 0 d08= 1 0 0 0 d01= 0 0 0 1 d09= 1 0 0 1 |
0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 |
0,0 1,0 0,1 1,1 |
0 0 0 0 =d00 1 0 0 0 =d08 0 1 0 0 =d04 1 1 0 0 =d12 |
0,0 1,0 0,1 1,1 |
m0 m1 m2 m3 |
d02= 0 0 1 0 d10= 1 0 1 0 d03= 0 0 1 1 d11= 1 0 1 1 |
0 0 0 1 1 0 0 1 0 1 0 1 1 1 0 1 |
0,1 1,1 0,0 1,0 |
0 1 0 1 =d05 1 1 0 1 =d13 0 0 0 1 =d01 1 0 0 1 =d09 |
0,0 1,0 0,1 1,1 |
m0 m1 m2 m3 |
d04= 0 1 0 0 d12= 1 1 0 0 d05= 0 1 0 1 d13= 1 1 0 1 |
0 0 1 0 1 0 1 0 0 1 1 0 1 1 1 0 |
1,0 0,0 1,1 0,1 |
1 0 1 0 =d10 0 0 1 0 =d02 1 1 1 0 =d14 0 1 1 0 =d06 |
0,0 1,0 0,1 1,1 |
m0 m1 m2 m3 |
d06= 0 1 1 0 d14= 1 1 1 0 d07= 0 1 1 1 d15= 1 1 1 1 |
0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 |
1,1 0,1 1,0 0,0 |
1 1 1 1 =d15 0 1 1 1 =d07 1 0 1 1 =d11 0 0 1 1 =d03 |
0,0 1,0 0,1 1,1 |
m0 m1 m2 m3 |
Given a transform with D datapoints such that D is a power of two and D β₯ 4; [SR:] if the algorithm is to work correctly for one butterfly (B=1), we should have four memory banks (M=4) numbered 0 to 3, or 002 to 112 i.e. m1m0={002,012,102,112} i.e. m0=002, m1=012, m2=102 and m3=112.
The following equation(s) will show how to transform a given stage's datapoint sequence
- (dp(D-1) dp(D-2)...dp1,dp0)
into an equivalent sequence
- (dp'(D-1) dp'(D-2)...dp'1,dp'0)
- (mb'(D-1) mb'(D-2)...mb'0,mb'1)
- (m0,m1,m2,m3, m0,m1,m2,m3 ... m0,m1,m2,m3),
So:
Each datapoint dps,i in a given transform stage s β (0 .. S-1) is an S-bit number
- dps,i = dS-1dS-2...d1d0
If s = S-1 (final stage), we will replace each datapoint
- dps,i = dS-1dS-2...d1d0
- dp's,i = dS-1d0dS-2...d1
For all other stages dp' is the same as dp. In other words:
dp's,i = dps,i | β s β (0 .. S - 2) | and i β (0 .. D-1) | |
dp's,i = dS-1d0dS-2...d1 | β s = (S - 1) | and i β (0 .. D-1) |
Having made this first transformation
- dp's,i = Ζ(dps,i),
- dp"s,i = d"S-1d"S-2...d"1d"0
if s β (0 .. S-2) then | d"s = P's and d"(s+1) = P'(s+1) | and all other d"j = d'j , where j β (0..S-1) and j β {s,s+1} |
else (final stage s = S-1) | d"s = P's and d"s-1 = P's-1 | and all other d"j = d'j , where j β (1..S-2) |
and
- P'0 = P'even = Parity(d'S-2,d'S-4...d'2,d'0)
- P'1 = P'odd = Parity(d'S-1,d'S-3...d'3,d'1)
P'k = P'0 = P'even | if k is even and |
P'k = P'1 = P'odd | if k is odd |
OOPS maybe need to add step 3 where Memory bank m = P"1P"0
Having solved the problem for B=1 and M=4, the extension to larger values of B is fairly straightforward.
For our algorithm to work, there must always be four memory banks for each butterfly unit. For one butterfly (B=1) we had four memory banks (M=4) and thus two toggling bits. When there are two butterfly units (B=2) we have eight memory banks (M=8) and thus three toggling bits, and so on, adding one more toggling bit for each doubling of B.
Following the logic we used to develop the algorithm for B=1 above, we can continue to define d' = d for those stages where the toggling bits are adjacent and do not wrap around the high end of the address, as in e.g. stages 0 and 1 of a two-butterfly implementation of a 16-point transform with three toggling bits.
<caption>With three toggle bits, non-adjacency begins in Stage 2</caption>Stage 0 | Stage 1 | Stage 2 | Stage 3 | |
---|---|---|---|---|
|
|
|
|
|
Oct 0 | d00=0000 | d00=0000 | d00=0000 | d00=0000 |
d01=0001 | d02=0010 | d04=0100 | d08=1000 | |
d02=0010 | d04=0100 | d08=1000 | d01=0001 | |
d03=0011 | d06=0110 | d12=1100 | d09=1001 | |
d04=0100 | d08=1000 | d01=0001 | d02=0010 | |
d05=0101 | d10=1010 | d05=0101 | d10=1010 | |
d06=0110 | d12=1100 | d09=1001 | d03=0011 | |
d07=0111 | d14=1110 | d13=1101 | d11=1011 | |
|
|
|
|
|
... | ... | ... | ... | ... |
Oct 1 | d14=1110 | d13=1101 | d11=1011 | d07=0111 |
d15=1111 | d15=1111 | d15=1111 | d15=1111 |
Note, extending the introductory explanation to larger values of B, because we have eight memory banks we now have two "octs" (groups of eight) with three toggle bits each instead of four "quads" (groups of four) with two toggle bits.
Stages 2 and 3 now both have non-adjacent toggle bits, and in each of these stages we have to wrap the lower address bits such that the low toggling bits adjoin the high toggling bits, as in the example below.
<caption>With three toggle bits, non-adjacency must be corrected in Stages 2 and 3</caption>(Stages 0 and 1 same as before) | (New/rotated Stages 2 and 3) | |||
---|---|---|---|---|
Stage 0 | Stage 1 | Stage 2 | Stage 3 | |
|
|
|
|
|
Oct 0 | d00=0000 | d00=0000 | d00=0000 | d00=0000 |
d01=0001 | d02=0010 | d04=0100 | d08=1000 | |
d02=0010 | d04=0100 | d08=1000 | d02=0010 | |
d03=0011 | d06=0110 | d12=1100 | d10=1010 | |
d04=0100 | d08=0001 | d02=0010 | d04=0100 | |
d05=0101 | d10=0011 | d06=0110 | d12=1100 | |
d06=0110 | d12=0101 | d10=1010 | d06=0110 | |
d07=0111 | d14=0111 | d14=1110 | d14=1110 | |
|
|
|
|
|
... | ... | ... | ... | ... |
Oct 1 | d14=1110 | d13=1101 | d11=1011 | d07=0111 |
d15=1111 | d15=1111 | d15=1111 | d15=1111 |
Our original discussion for a one-butterfly FFT talked about quads mapping to four memory banks. For more than one butterfly we update this discussion with octs that map to eight memory banks, in the case of two butterfly units, or sixteens mapping to sixteen banks when B=4, etc. The algorithm works so long as there are four memory banks for every butterfly unit, i.e. M=4B.
Similarly, when B=2 (two butterflies) we have three toggle bits instead of two, and when B=4 we have four toggle bits, and so on. And instead of simply having odd and even parity, as in the original case where B=1, we need to define some new terms for parity, that is
- P30 = the parity of every third bit including bit b0, i.e. P30 = XOR(b0, b3, b6, ...)
- P31 = the parity of every third bit including bit b1, i.e. P31 = XOR(b1, b4, b7, ...)
- Pni = the parity of every nth bit bj such that (j mod n) = (i mod n)
- Peven = P20 = P22 = P24 = ...
- Podd = P21 = P23 = P25 = ...
- P30 = P33 = P36 = P39 = ...
- and so on
In the case where there are two butterfly units (B=2), there are now three toggle bits ti, ti+1 and ti+2; to canonicalize the bank order (eight banks now instead of four), we replace these toggle bits with parity bits P3i, P2i+1 and P2i+2
In general, then, when B=2n, there will be 4B memory banks and n+1 toggle bits titi+1ti+2..ti+n which we replace with parity bits PniPni+1Pni+2..Pni+n in order to remap the operand addresses such that each sequential group of M operands in a given stage have the exact same memory-bank sequence.
Below are two examples of the algorithm at work, calculating the schedules for Stages 0 and 3 of a sixteen-point transform using two radix-2 butterfly units.
<caption>Example mapping for stage s=0 of a transform for D=16 datapoints and two butterflies (B=2)</caption>
Datapoint d=d3d2d1d0 |
Reorder 1 (no change) d' = d3d2d1d0 |
Parity (P=P3) P'2,P'1,P'0 |
Reorder 2 (replace) d" = d'3P'2P'1P'0 |
Parity (P=P3) P"2,P"1,P"0 ! Memory bank m = P"2P"1P"0 |
|
---|---|---|---|---|---|
d00= 0 0 0 0 d01= 0 0 0 1 d02= 0 0 1 0 d03= 0 0 1 1 |
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 |
0,0,0 0,0,1 0,1,0 0,1,1 |
0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03 |
0,0,0 0,0,1 0,1,0 0,1,1 |
m0 m1 m2 m3 |
d04= 0 1 0 0 d05= 0 1 0 1 d06= 0 1 1 0 d07= 0 1 1 1 |
0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 |
1,0,0 1,0,1 1,1,0 1,1,1 |
0 1 0 0 = d04 0 1 0 1 = d05 0 1 1 0 = d06 0 1 1 1 = d07 |
1,0,0 1,0,1 1,1,0 1,1,1 |
m4 m5 m6 m7 |
d08= 1 0 0 0 d09= 1 0 0 1 d10= 1 0 1 0 d11= 1 0 1 1 |
1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 |
0,0,1 0,0,0 0,1,1 0,1,0 |
1 0 0 1 = d09 1 0 0 0 = d08 1 0 1 1 = d11 1 0 1 0 = d10 |
0,0,0 0,0,1 0,1,0 0,1,1 |
m0 m1 m2 m3 |
d12= 1 1 0 0 d13= 1 1 0 1 d14= 1 1 1 0 d15= 1 1 1 1 |
1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 |
1,0,1 1,0,0 1,1,1 1,1,0 |
1 1 0 1 = d13 1 1 0 0 = d12 1 1 1 1 = d15 1 1 1 0 = d14 |
1,0,0 1,0,1 1,1,0 1,1,1 |
m4 m5 m6 m7 |
Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and two butterflies (B=2)
Datapoint d=d3d2d1d0 |
Reorder 1 (rotate) d' = d3d1d0d2 |
Parity (P=P3) P'2,P'1,P'0 |
Reorder 2 (replace) d" = P'3P'2P'1d'0 |
Parity (P=P3) P"2,P"1,P"0 ! Memory bank m = P"2P"1P"0 |
|
---|---|---|---|---|---|
d00= 0 0 0 0 d08= 1 0 0 0 d01= 0 0 0 1 d09= 1 0 0 1 |
0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 |
0,0,0 0,0,1 0,1,0 0,1,1 |
0 0 0 0 =d00 1 0 0 0 =d08 0 0 1 0 =d02 1 0 1 0 =d10 |
0,0,0 0,0,1 0,1,0 0,1,1 |
m0 m1 m2 m3 |
d02= 0 0 1 0 d10= 1 0 1 0 d03= 0 0 1 1 d11= 1 0 1 1 |
0 1 0 0 1 1 0 0 0 1 1 0 1 1 1 0 |
1,0,0 1,0,1 1,1,0 1,1,1 |
0 1 0 0 =d04 1 1 0 0 =d12 0 1 1 0 =d06 1 1 1 0 =d14 |
1,0,0 1,0,1 1,1,0 1,1,1 |
m4 m5 m6 m7 |
d04= 0 1 0 0 d12= 1 1 0 0 d05= 0 1 0 1 d13= 1 1 0 1 |
0 0 0 1 1 0 0 1 0 0 1 1 1 0 1 1 |
0,0,1 0,0,0 0,1,1 0,1,0 |
1 0 0 1 =d09 0 0 0 1 =d01 1 0 1 1 =d11 0 0 1 1 =d03 |
0,0,0 0,0,1 0,1,0 0,1,1 |
m0 m1 m2 m3 |
d06= 0 1 1 0 d14= 1 1 1 0 d07= 0 1 1 1 d15= 1 1 1 1 |
0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 |
1,0,1 1,0,0 1,1,1 1,1,0 |
1 1 0 1 =d13 0 1 0 1 =d05 1 1 1 1 =d15 0 1 1 1 =d07 |
1,0,0 1,0,1 1,1,0 1,1,1 |
m4 m5 m6 m7 |
Given a transform with D datapoints such that D is a power of two and D β₯ 4; [SR:] if the algorithm is to work correctly for any power-of-two number of butterfly units B=2n, we should have 4B memory banks (M=4B) numbered 0 to (4B-1), or 00...002 to 11...112 i.e.
Bank Name |
Bank Number m |
Bank Bits mn+1mnmn-1...m0 |
---|---|---|
m0 | (m=0) | 00...002 = 0 |
m1 | (m=1) | 00...012 = 1 |
β | β | β |
mn | (m=4B-2) | 11...102 = 4B-2 |
m(n+1) | (m=4B-1) | 11...112 = 4B-1 |
Each operand address dps,i in a given stage s has T toggle bits where T=log2M
The following equation(s) will show how to transform a given stage's datapoint sequence
- (dp(D-1) dp(D-2)...dp1,dp0)
into an equivalent sequence
- (dp'(D-1) dp'(D-2)...dp'1,dp'0)
- (mb'(D-1) mb'(D-2)...mb'0,mb'1)
- (
- (m0,m1,m2 ... m(4B-1)),
- (m0,m1,m2 ... m(4B-1)),
- (m0,m1,m2 ... m(4B-1)),
- ...)
So:
Each datapoint dps,i in a given transform stage s β (0 .. S-1) is an S-bit number
- dps,i = dS-1dS-2...d1d0
If s β (S-(T-1)..S-1) (i.e. final T-1 stages), we will replace each datapoint
- dps,i = (dS-1dS-2...ds)(ds-1ds-2...dT-1)(dT-2dT-3...d0)
- dp's,i = (dS-1dS-2...ds)(dT-2dT-3...d0)(ds-1ds-2...dT-1)
For all other stages dp' is the same as dp. In other words:
dp's,i = dps,i | β s β (0 .. S-T) | and i β (0 .. D-1) | |
dp's,i = (dS-1dS-2...ds)(dT-2dT-3...d0)(ds-1ds-2...dT-1) | β s β (S - (T-1) .. S-1) | and i β (0 .. D-1) |
Having made this first transformation
- dp's,i = Ζ(dps,i),
- dp"s,i = d"S-1d"S-2...d"1d"0
if s β (0 .. S-T) then | d"sd"s+1d"s+2..d"s+(T-1) = P'sP's+1P's+2..P's+(T-1) | and all other d"j = d'j |
else (final T-1 stages) | d"sd"s -1d"s -2..d"s -(T-1) = P'sP's -1P's -2..P's -(T-1) | and all other d"j = d'j |
and
- P'i = PTi = the parity of every T th bit d'j such that (j mod T) = (i mod T)
- P'0 = P30 = XOR(d'0, d'3, d'6, ...)
- P'1 = P31 = XOR(d'1, d'4, d'7, ...)
- P'2 = P32 = XOR(d'2, d'5, d'8, ...)
- P'3 = P30 = XOR(d'0, d'3, d'6, ...)
- etc.
I think the case for radix R>2 is exactly analogous to the case where B>1. That is, for e.g. a one-butterfly radix-four operation (B=1 and R=4), the operands are grouped by four in exactly the way that they would be grouped for a two-butterfly radix-two operation (B=2 and R=2). In other words, the grouping for [TODO/FIXME]
For radix-2 operation we needed M=4B memory banks for groups of 4B operands. In the general case, where R can be any power of 2 greater than or equal to 2, we need M=4B(R-2) memory banks. Note also that S and s must change appropriately; the remainder of the proof for radix R>2 is left as an exercise to the reader.
Oops still need to write something here maybe.
We will try to use symbols consistently throughout this writeup, as described by this alphabetically-ordered reference table.
Italicized symbols (B, D, d, M) represent numbers, non-italicized symbols (d0, m3) are names.
lower-case symbols (d, m) are variable, UPPER-CASE symbols (D, M) are constant (for a given FFT).
Symbol | Meaning | Notes |
---|---|---|
B | How many butterfly units | A power of two, e.g. 8, 16, 1024 or 8096. To complete a transform, an FFT with e.g. one butterfly unit (B=1) takes about four times as long as an FFT with four butterfly units (B=4). |
D | How many datapoints | Number of datapoints to transform. A power of two, e.g. 8, 16, 1024 or 8096. |
d | Datapoint | An eight-point transform has datapoints d=0 through d=7. |
di | A specific datapoint |
A specific datapoint i. For example, d4 is the datapoint d such that d == 4. An eight-point transform has datapoints d0 through d7 (see sched(s) below). |
di | Datapoint bit number i | d0 is always the least-significant bit (LSB). Example: datapoint d13 has four bits d3d2d1d0 = 11012 |
dps,i | Datapoint i in stage s |
Same as sched(s)i (see below). Each datapoint dps,i in a given transform stage s β (0 .. S-1) is an S-bit number dps,i = dS-1dS-2...d1d0, where i β (0..D-1). |
dpi | dps,i with implied s | Shorthand for dps,i where exact stage s is implied by surrounding context. |
i, j, k, n | Unknown integer | We usually start numbering from i=0 |
M | How many memory banks | For purposes of this proof M == 4B always. |
m | Memory bank | Generally need four memory banks per butterfly e.g. m=0 through m=7 for two butterflies (B=2). |
mi | A specific memory bank | Generally need four memory banks per butterfly e.g. m0 through m7 for two butterflies (B=2). |
mi | Memory bank bit number i | Generally need four memory banks per butterfly e.g. m2m1m0={000,001 ... 111} for two butterflies (B=2). |
Mbits | Bits per bank number | How many bits in each mem bank number m; Mbits = log2(M) |
mbs,i | Memory bank of dps,i | The memory bank corresponding to datapoint i of stage s. |
mbi | mbs,i with implied s |
Shorthand for mbs,i where exact stage s is implied by surrounding context. For a one-butterfly eight-point transform, our algorithm results in three stages Stage 0: (mb0,mb1,mb2,mb3,mb4,mb5,mb6,mb7) = (m0,m1,m2,m3, m0,m1,m2,m3) Stage 1: (mb0,mb1,mb2,mb3,mb4,mb5,mb6,mb7) = (m0,m2,m1,m3, m0,m2,m1,m3) Stage 2: (mb0,mb1,mb2,mb3,mb4,mb5,mb6,mb7) = (m0,m1,m2,m3, m0,m1,m2,m3) |
Pod, Podd |
Parity of odd bits |
Example: decimal 13 has four bits d3d2d1d0 = 1101; thus Pod is parity(d3,d1) = parity(1,0) = 1 Note Podd is the same as P21 (below). |
Pev, Peven |
Parity of even bits |
Example: decimal 13 has four bits d3d2d1d0 = 1101; thus Pev is parity(d2,d0) = parity(1,1) = 0 Note Peven is the same as P20 (below). |
Pni | Parity of periodic bits |
The parity of every nth bit bj such that (j mod n) = (i mod n) Example: P30 = the parity of every third bit including bit b0, i.e. P30 = XOR(b0, b3, b6, ...) Note that, by this definition, P30 = P33 = P36 = P39 and so on. |
R | Radix | Default is 2 |
S | How many stages | Number of stages in a transform, e.g. an eight-point transform has three stages, so S=3 |
s | Stage number | An eight-point transform has three stages s=0, s=1 and s=2. |
sched(s) | Schedule for stage s |
The FFT schedule for stage s of a transform. E.g. for an 8-point transform there are three stages sched(0) = (d0,d1,d2,d3,d4,d5,d6,d7) sched(1) = (d0,d2,d4,d6,d1,d3,d5,d7) and sched(2) = (d0,d4,d1,d5,d2,d6,d3,d7) |
sched(s)i | iβth element of sched(s) | The iβth element of sched(s), e.g. in an 8-point transform, sched(1)1 = d2 (see above). |
T | How many toggle bits | The number of toggle bits in each datapoint address. T == log2M |
ti | Toggle bit number i | t0 is always the fastest-toggling (least significant) toggle bit. Example: a one-butterfly eight-point transform has two toggle bits. In Stage 0, t1t0 = d1d0; in Stage 1, t1t0 = d2d1; and in Stage 2, t1t0 = d0d2. |
Term | Meaning | Notes |
---|---|---|
LSB | Least-significant bit | The least-significant bit in the binary representation of a given number. All even integers have LSB = 0 and all odd integers have LSB = 1. |
MSB | Most-significant bit | The most-significant bit of a number depends on the number of bits used to represent the binary form of that number. E.g. if 2010 is represented using five bits 101002, the MSB is 1, but if it is represented as eight bits 0001β01002 the MSB is 0. |
oct? | group of eight? aligned vs. unaligned? | |
quad? | group of four? aligned vs. unaligned? | |
others? |
The algorithm as implemented in the Perl test program (FIXME/TODO: NOTE BENE THIS IS NOT CORRECT, NEEDS SEVERE UPDATING)
# Normal treatment for stages 0 through nstages-2 if ($s != ($nstages-1)) { $op1 = smart_replace($op1, $s); $op1 = smart_replace($op1, $s+1); $op2 = smart_replace($op2, $s); $op2 = smart_replace($op2, $s+1); } # Last stage nstages-1 needs special treatment. elsif ($s%2 == 1) { # Even number of bits; use zero instead of s+1 maybe. $op1 = smart_replace($op1, $s); $op1 = smart_replace($op1, 0); $op2 = smart_replace($op2, $s); $op2 = smart_replace($op2, 0); } else { # Odd number of bits; gotta do sumpm crazy maybe. $op1 = weird_replace($op1, $s); $op2 = weird_replace($op2, $s); } # Swizzle to assign correct bank & row to $op1, $op2 my ($b1, $r1) = swizzler::swizzle($op1, $npoints, 4*$nunits); my ($b2, $r2) = swizzler::swizzle($op2, $npoints, 4*$nunits); ############################################################################## sub smart_replace { # In the given number $n, replace bit $i with new bit $b # If $i is odd, new bit $b is the parity of the odd bits of $n # If $i is even, new bit $b is the parity of the even bits of $n # my $n = shift; my $i = shift; my $newbit = $i%2 ? parity_odd($n) : parity_even($n); $n = replace_bit($n, $i, $newbit); # Put $newbit into $n at position $i return $n; } sub weird_replace { # Given number $n and ultimate stage $s, where $s is even, do some weird $#!? my $n = shift; my $s = shift; # This is supposed to work at least for when nbits(datapoints) is odd. i think. my $b0 = ($n & 0x1); my $b1 = ($n & 0x2) >> 1; my $new_top_bit = parity_even($n) ^ $b1; my $new_b1 = parity_odd($n) ^ $b1 ^ $b0; my $new_b0 = $b1 ^ $b0; # Complicated enough yet!? $n = replace_bit($n,$s,$new_top_bit); # Put $new_top_bit into $n at position $s $n = replace_bit($n, 1,$new_b1); # Put $new_b1 into $n at position 1 $n = replace_bit($n, 0,$new_b0); # Put $new_b0 into $n at position 0 (LSB right?) return $n; }
Caveat: The Stanford FFT generator, which I authored, may not yet actually use the algorithm that results directly from this proof. But I plan to do that eventually...!
Todo:
- present to the group and invite comments including where/whether to submit to journal, conference, etc.
- update the Perl program in the appendix.
- change to latex format for publication (replace html page with a latex page).
- make sure FFT generator (correctly) uses algorithm described in the formal proof!
Projects: FFT Formal Proof v0 (old) - Old formal proof with separate boundary conditions for odd and even number of bits, etc.
Projects: FFT Formal Proof - Auxiliary Info
Projects: FFT_Formal_Proof - Auxiliary_Info#Trash
Hand-written notes where the proof originally was developed:
https://vlsiweb.stanford.edu/mediawiki/index.php/File:Formal_proof_notes.pdf