Table of Contents DO NOT READ THIS PAPER!!! The Problem Developing the proof for one butterfly unit (B=1) Key Insights Insight 1: We can ensure that each of the four memory banks m0, m1, m2 and m3 are represented in every aligned quad, in some order Insight 2: We can reorder the datapoints such that each quad contains exactly four memory banks m0, m1, m2 and m3 in a consistent order Summary Boundary condition: non-adjacent toggle bits The final-stage problem of non-adjacent toggle bits Solving the final-stage problem of non-adjacent toggle bits Example for one butterfly unit (B=1) Example mapping for stage s=0 of a transform for D=16 datapoints and one butterfly (B=1) Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and one butterfly (B=1) Algorithm 1: One radix-2 butterfly (R=2 and B=1) What we're gonna do. Step 1: dp → dp' Step 2: dp' → dp" Step 3?: m = P"₁P"_{0 ?} Extension of the proof to more than one butterfly unit (B>1) Example for more than one butterfly unit (B=2) Example mapping for stage s=0 of a transform for D=16 datapoints and two butterflies (B=2) Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and two butterflies (B=2) Algorithm 2: More than one radix-2 butterfly (R=2 and B>1) What we're gonna do (B>1) Step 1: dp → dp' (B>1) Step 2: dp' → dp" (B>1) Extension to radix R>2, longer pipelines Extension to R>2 Extension to longer pipelines. Nomenclature and Terminology Nomenclature (Symbols) Terminology Appendix: Perl program that implements the algorithm Major caveat(s), to-do and notes Other Notes, Trash and Auxiliary Info Original hand-written notes

DO NOT READ THIS PAPER!!!

I mean, sure, you can read it if you want to, but it's pretty awful. All the information is there, but it's just not arranged very well (IMO as the author). I'm starting all over with a second attempt here:

https://vlsiweb.stanford.edu/mediawiki/index.php/LaTeX:FFT_Formal_Proof

Meanwhile, continue reading this paper if you like, but read on at your own risk...paper continues below...

The Problem

A detailed description of the problem we want to solve can be found in the 2013 VLSI-SoC paper

An area-efficient minimum-time FFT schedule using single-ported memory, Richardson, Shacham, Markovic and Horowitz. Proceedings of the 21st IFIP/IEEE International Conference on VLSI (VLSI-SoC), 2013.

To summarize the problem here: take a standard schedule of operations for a Fast Fourier Transform processor (FFT) based on a radix-2 Cooley-Tukey algorithm [ref], e.g. for D=8 datapoints and one butterfly unit, as shown in Figure 1. (Note: for your convenience, symbols and conventions used in the ensuing discussion are elaborated in the Nomenclature section at the end of the paper).

FIGURE 1

That is, suppose we have eight datapoints d0 through d7 that we want to transform from the time domain to the frequency domain. In Stage 0 of operation, a radix-2 butterfly unit reads datapoints d0 and d1, producing two results that then overwrite the original locations of d0 and d1 in memory. Similar operations occur for datapoint pairs (d2,d3), (d4,d5) and (d6,d7) respectively. In Stage 1, the datapoints are processed with stride-two pairings e.g. (d0,d2), (d4,d6), (d1,d3) and (d5,d7). And finally Stage 2 completes the operation, re-walking the data by stride 4. Table 1 summarizes this access pattern.

 TABLE 1
 Stage 0: (d0,d1) (d2,d3) (d4,d5) (d6,d7)
 Stage 1: (d0,d2) (d4,d6) (d1,d3) (d5,d7)
 Stage 2: (d0,d4) (d1,d5) (d2,d6) (d3,d7)

If the butterfly unit is pipelined such that result writebacks occur at the same time as operand reads, then datapoints (d0,d1) get written to memory at the same time as (d2,d3) are being read from memory. For memory to be able to supply all four datapoints at once, it needs to be four-ported. Alternatively, we could try to use two separate banks of two-ported memory or even four banks of single-ported memory. Of these three choices, four banks of single-ported memory is most efficient in terms of area, so we'd like to make that work if possible.

Now: Within any stage, look at any sequential group of four operands starting on a mod-two boundary. We'll call such a group a quad. If the quad happens to start on a mod-four boundary, we'll call this an aligned quad. Stage 0 has three overlapping quads: (d0,d1,d2,d3), (d2,d3,d4,d5) and (d4,d5,d6,d7), the first and last of which are aligned quads. Our goal is to map each datapoint to one of four memory banks (m0,m1,m2,m3) such that every quad contains exactly one of each bank number m0, m1, m2 and m3. For example, we could map datapoints (d0,d1,d2,d3,d4,d5,d6,d7) to memory banks (m0,m1,m2,m3,m0,m1,m2,m3) respectively, and thus fulfill our goal for each of the three quads in Stage 0:

(d0,d1,d2,d3) maps to (m0,m1,m2,m3)
(d2,d3,d4,d5) maps to (m2,m3,m0,m1)
(d4,d5,d6,d7) maps to (m0,m1,m2,m3)

The first quad in Stage 1, however, fails because its datapoints (d0,d2,d4,d6) would map to (m0,m2,m0,m2). Banks m1 and m3 are not represented in the quad, while m0 and m2 each appear twice.

We're going to fulfill our goal by enforcing an order to each aligned quad, such that every aligned quad within a stage has the same memory bank ordering e.g. (m0,m1,m2,m3). This means that every unaligned quad will also have a consistent ordering, e.g. (m2,m3,m1,m0), and thus we meet our goal of uniqueness for all quads. This property will hold whether the transform is operating on eight datapoints, with two aligned quads per stage, or eight thousand datapoints, with two thousand quads per stage.

In theory the four operations in a given stage all happen simultaneously. In practice, it's not always practical to have enough butterfly units for that. So for instance, if we only have one butterfly unit, it will have to be used four times in succession to complete the transformation for a given stage, using the following four operations:

read, process and writeback datapoints d0 and d1;
read, process and writeback datapoints d2 and d3;
read, process and writeback datapoints d4 and d5;
read, process and writeback datapoints d6 and d7.

In pipelined operation, the writeback of a given data pair occurs at the same time as the read of the next pair of datapoints. Thus for example datapoints (d0,d1,d2,d3) will all have to be accessible at the same time within the same cycle, as will the datapoints in each of the other quads. We want to use single-ported memory for the datapoints, and that is why we require that the datapoints in each quad must all live in separate memory banks (and assuming, of course, that each memory bank has its own separate path to the butterfly unit).

Developing the proof for one butterfly unit (B=1)

Key Insights

Insight 1: We can ensure that each of the four memory banks m0, m1, m2 and m3 are represented in every aligned quad, in some order

Once again, here is our schedule for D=8 datapoints:

Stage 0: (d0,d1) (d2,d3) (d4,d5) (d6,d7)
Stage 1: (d0,d2) (d4,d6) (d1,d3) (d5,d7)
Stage 2: (d0,d4) (d1,d5) (d2,d6) (d3,d7)

Looking carefully at the aligned quads, we see that they all share a unique property. For example, consider the quad represented by the four operands in the last two operations of Stage 1: (d1,d3),(d5,d7). Looking at the binary representation of the indices within any such group of four, only two bits change, the bits d_s+1 and d_s where s is the stage number and bit d₀ is the least significant bit, or LSB. [NB] We call these two bits the toggling bits, and they always count 00, 01, 10, 11 within an aligned quad. In our example, the binary indices for (d1,d3),(d5,d7) are (00001,00011),(00101,00111). The toggling bits d₂ and d₁ count (00,01),(10,11) while the non-toggling bits d₄d₃d₀ are always 001.

Given this information, we can assign banks to operands such that banknum = P_odP_ev where P_od is the parity of ALL odd bits and P_ev is the parity of ALL even bits. This means that any given sequential block of four operands will map to four separate banks.

In our example quad [(d1,d3),(d5,d7)], or [(00001,00011),(00101,00111)]

d1: P_od(00001) = parity(0,0) = 0; P_ev(00001) = parity(0,0,1) = 1; P_odP_ev = 01; banknum m = 1
d3: P_od(00011) = parity(0,1) = 1; P_ev(00011) = parity(0,0,1) = 1; P_odP_ev = 11; banknum m = 3
d5: P_od(00101) = parity(0,0) = 0; P_ev(00101) = parity(0,1,1) = 0; P_odP_ev = 00; banknum m = 0
d7: P_od(00111) = parity(0,1) = 1; P_ev(00111) = parity(0,1,1) = 0; P_odP_ev = 10; banknum m = 2

Insight 2: We can reorder the datapoints such that each quad contains exactly four memory banks m0, m1, m2 and m3 in a consistent order

When the stage number s is even [footnote:], then P_ev within any group of four will be either (0,1,0,1) or (1,0,1,0) depending on the (constant 0 or 1) parity of the non-toggling bits. Why? Because we know that bit d_s within any group of four always counts (0,1,0,1) while the remaining (non-toggling) even bits remain constant. Similarly, P_od will be either (0,0,1,1) or (1,1,0,0). Bank number P_odP_ev for a quad can thus be one of the following four groups:

(00,01,10,11);
(01,00,11,10);
(10,11,00,01); or
(11,10,01,00).

But we want to canonicalize the bank order such that bank numbers always follow the same sequence e.g. (00,01,10,11) for every aligned quad within the stage. To do this, we must conspire to e.g. make P_ev count only (0,1,0,1) within each aligned group of four and never (1,0,1,0). Similarly, P_od must always count e.g. (0,0,1,1) and never (1,1,0,0).

Suppose, that, when stage number s is even and when P_ev is about to count (1,0,1,0), we change d_s (a component of P_ev because s is even) such that it counts (1,0,1,0) instead of the usual (0,1,0,1). In other words, instead of d=(d1,d3,d5,d7), where d₁ goes (0,1,0,1) we change the sequence to d'=(d3,d1,d7,d5), so that d'₁ goes (1,0,1,0). Then P'_ev for the new sequence d' will count in the desired order (0,1,0,1).

In general, is it okay to change the sequence like this when needed? Yes it is because 1) within any given operand pair, the order is unimportant, thus as far as the FFT is concerned, (d3,d1,d7,d5) is the same as (d1,d3,d5,d7); and 2) the net effect of changing the polarity of the least-significant toggle bit d_s is simply to swap the order of operands in a given pair.

How do we know when P_ev is going to count the wrong direction such that d_s will need to change? Simple:

 if ds counts   and Pev counts    then do this    or in other words
    0,1,0,1        0,1,0,1          ds' =  ds         ds' = Pev
    0,1,0,1        1,0,1,0          ds' = ~ds         ds' = Pev
    1,0,1,0        0,1,0,1          ds' =  ds         ds' = Pev
    1,0,1,0        1,0,1,0          ds' = ~ds         ds' = Pev

In other words, we simply replace d_s with P_ev to form the new datapoint sequence d'. P'_ev for the new sequence order will always be (0,1,0,1, 0,1,0,1, ...)

We also need to canonicalize P_od such that it always counts (0,0,1,1) and never (1,1,0,0). For this, we note when P_od is about to count (1,1,0,0), and then we change d_s+1 (a component of P_od because s is even) such that it counts (1,1,0,0) instead of the usual (0,0,1,1). In other words, instead of (d1,d3,d5,d7), where d₂ goes (0,0,1,1) we would change the sequence to (d5,d7,d3,d1), so that d₂ goes (1,1,0,0).

In general, is it okay to change the sequence like this when needed? Yes it is because 1) within any given stage, the order of the operand pairs is unimportant; as far as the FFT is concerned, (d5,d7,d1,d3) is the same as (d1,d3,d5,d7); and 2) the net effect of changing the polarity of the toggle bit d_s+1 is simply to swap the order of operand pairs in a given quad. Following logic similar to that used for P_ev, we find that we can canonicalize the order of P'_od by replacing d_s+1 with P_od.

Using the transformation described above, then, when s is even we can change any data sequence d into an equally-valid data sequence d' such that we achieve a consistent bank ordering (m0,m1,m2,m3, m0,m1,m2,m3, ...) And it turns out that, using the same transformation when s is odd, we achieve a consistent bank ordering (m0,m2,m1,m3, m0,m2,m1,m3, ...)

Summary

Somewhat more formally, then, let's define P_i such that

P_i = P_ev when i is even, and
P_i = P_od when i is odd.

Now we see that WE CAN ALWAYS REPLACE d_s+1 AND d_s WITH P_s+1 AND P_s RESPECTIVELY TO CANONICALIZE THE BANK ORDER. In particular, given this transformation, banks in an aligned even-stage quad will always count (00,01,10,11) and the banks in an aligned odd-stage quad will always count (00,10,01,11).

In our example, (d1,d3,d5,d7) is part of an odd stage; P_od=(0,1,0,1) and P_ev=(1,1,0,0). Because P_ev counts incorrectly (1,1,0,0) instead of (0,0,1,1), we replace d_s+1 with P_ev to come up with a new ordering (d5,d7,d1,d3). Now P_od=(0,1,0,1) and P_ev=(0,0,1,1) and final bank number assignments read (00,10,01,11), which is the correct canonical order for an odd stage.

This is the core of the proof. However, we still need to place it in a more formal context. Also, we must discuss:

the special cases for the boundary condition when s and s+1 wrap around at the end, that is when s = S-1.
how the proof extends to any power-of two values for B, D or R maybe;
and, for advanced work, we should mention how the proof extends to any pipeline depth.

Boundary condition: non-adjacent toggle bits

The final-stage problem of non-adjacent toggle bits

Note that our insight regarding quads changes slightly in the final stage of operation, when s=S-1. In all previous stages up to that point, the toggling bits are adjacent (see example below).

**The Stage-3 problem of non-adjacent toggle bits**
	Stage 0	Stage 1	Stage 2	Stage 3

Quad 0	d00=0000	d00=0000	d00=0000	d00=0000
d01=0001	d02=0010	d04=0100	d08=1000
d02=0010	d04=0100	d08=1000	d01=0001
d03=0011	d06=0110	d12=1100	d09=1001

Quad 1	d04=0100	d08=1000	d01=0001	d02=0010
d05=0101	d10=1010	d05=0101	d10=1010
d06=0110	d12=1100	d09=1001	d03=0011
d07=0111	d14=1110	d13=1101	d11=1011

...	...	...	...	...
Quad 3	d14=1110	d13=1101	d11=1011	d07=0111
d15=1111	d15=1111	d15=1111	d15=1111

At the final stage s=S-1 (Stage 3 in the example), however, the toggling bits wrap around, such that the toggling bits are the least- and most-significant bits (LSB and MSB), i.e. bits d₀ and d_s instead of the usual bits d_s+1 and d_s (because in this case d_s+1 does not exist).

This wrap works fine if the number of bits is even, in which case d₀ is naturally part of P_ev and d_s part of P_od, same as in the case where the toggling bits are adjacent. When the number of bits is odd, however, we run into trouble, because now both of the toggling bits in stage S-1 (bits d₀ and d_s) are even.

Is there an easy way to solve this problem?

Solving the final-stage problem of non-adjacent toggle bits

Suppose we rewrite the Stage 3 addresses in the following way, rotating the lower three bits such that the toggling bits are once again adjacent, e.g. in our example we leave the most-significant bit (bit 3) alone and right-rotate the lower three bits, that is

d₃d₂d₁d₀ → d₃d₀d₂d₁

where the toggling bits are the MSB d₃ and the LSB d₀.

**Solving the Stage-3 problem of non-adjacent toggle bits.**
	...................................(Stages 0, 1, 2, same as before).............................			(new/rotated)
	Stage 0	Stage 1	Stage 2	Stage 3

Quad 0	d00=0000	d00=0000	d00=0000	d00=0000
d01=0001	d02=0010	d04=0100	d08=1000
d02=0010	d04=0100	d08=1000	d04=0100
d03=0011	d06=0110	d12=1100	d12=1100

Quad 1	d04=0100	d08=1000	d01=0001	d01=0001
d05=0101	d10=1010	d05=0101	d09=1001
d06=0110	d12=1100	d09=1001	d05=0101
d07=0111	d14=1110	d13=1101	d13=1101

...	...	...	...	...
Quad 3	d14=1110	d13=1101	d11=1011	d07=0111
d15=1111	d15=1111	d15=1111	d15=1111

Now we can be sure that the toggling bits are adjacent and thus they will belong to separate parity groups, one as part of parity-even and one as part of parity-odd. But is the rewrite legal?

For the FFT to get a correct result, the only requirement for Stage 3 is that the pairwise operands op1 and op2 be separated by 2^s = 8, that is

|op2 - op1| = 2^s = 8

By preserving the fastest-toggling bit d₃ as the MSB we guarantee that, within each operand pair,

The MSB toggles 0→1 or 1→0, and
the remaining bits (all of which are toggling more slowly than once per pair) are constant, and thus
the relationship between the two operands op1 and op2 in each pair is such that (|op2 - op1| == 8)

Example for one butterfly unit (B=1)

Below are two examples of the algorithm at work, calculating the schedules for Stages 0 and 3 of a sixteen-point transform using one radix-2 butterfly unit.

Example mapping for stage s=0 of a transform for D=16 datapoints and one butterfly (B=1)

**Example mapping for stage s=0 of a transform for D=16 datapoints and one butterfly (B=1)**
Datapoint d=d₃d₂d₁d₀	Reorder 1 (no change) d' = d₃d₂d₁d₀	Parity P'₁,P'₀	Reorder 2 (replace) d" = d'₃d'₂P'₁P'₀	Parity P"₁,P"₀	Memory bank m = P"₁P"₀
d00= 0 0 0 0 d01= 0 0 0 1 d02= 0 0 1 0 d03= 0 0 1 1	0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1	0,0 0,1 1,0 1,1	0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d04= 0 1 0 0 d05= 0 1 0 1 d06= 0 1 1 0 d07= 0 1 1 1	0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1	0,1 0,0 1,1 1,0	0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d08= 1 0 0 0 d09= 1 0 0 1 d10= 1 0 1 0 d11= 1 0 1 1	1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1	1,1 1,0 0,1 0,0	1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d12= 1 1 0 0 d13= 1 1 0 1 d14= 1 1 1 0 d15= 1 1 1 1	1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1	1,0 1,1 0,0 0,1	1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13	0,0 0,1 1,0 1,1	m0 m1 m2 m3

Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and one butterfly (B=1)

Example mapping for final stage s=S-1 (Stage 3) of a transform
for D=16 datapoints and one butterfly (B=1)
Datapoint d=d₃d₂d₁d₀	Reorder 1 (rotate) d' = d₃d₀d₂d₁	Parity P'₁,P'₀	Reorder 2 (replace) d" = P'₃P'₂d'₁d'₀	Parity P"₁,P"₀	Memory bank m = P"₁P"₀
d00= 0 0 0 0 d08= 1 0 0 0 d01= 0 0 0 1 d09= 1 0 0 1	0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0	0,0 1,0 0,1 1,1	0 0 0 0 =d00 1 0 0 0 =d08 0 1 0 0 =d04 1 1 0 0 =d12	0,0 1,0 0,1 1,1	m0 m1 m2 m3
d02= 0 0 1 0 d10= 1 0 1 0 d03= 0 0 1 1 d11= 1 0 1 1	0 0 0 1 1 0 0 1 0 1 0 1 1 1 0 1	0,1 1,1 0,0 1,0	0 1 0 1 =d05 1 1 0 1 =d13 0 0 0 1 =d01 1 0 0 1 =d09	0,0 1,0 0,1 1,1	m0 m1 m2 m3
d04= 0 1 0 0 d12= 1 1 0 0 d05= 0 1 0 1 d13= 1 1 0 1	0 0 1 0 1 0 1 0 0 1 1 0 1 1 1 0	1,0 0,0 1,1 0,1	1 0 1 0 =d10 0 0 1 0 =d02 1 1 1 0 =d14 0 1 1 0 =d06	0,0 1,0 0,1 1,1	m0 m1 m2 m3
d06= 0 1 1 0 d14= 1 1 1 0 d07= 0 1 1 1 d15= 1 1 1 1	0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1	1,1 0,1 1,0 0,0	1 1 1 1 =d15 0 1 1 1 =d07 1 0 1 1 =d11 0 0 1 1 =d03	0,0 1,0 0,1 1,1	m0 m1 m2 m3

Algorithm 1: One radix-2 butterfly (R=2 and B=1)

Given a transform with D datapoints such that D is a power of two and D ≥ 4; [SR:] if the algorithm is to work correctly for one butterfly (B=1), we should have four memory banks (M=4) numbered 0 to 3, or 00₂ to 11₂ i.e. m₁m₀={00₂,01₂,10₂,11₂} i.e. m0=00₂, m1=01₂, m2=10₂ and m3=11₂.

What we're gonna do.

The following equation(s) will show how to transform a given stage's datapoint sequence

(dp_(D-1) dp_(D-2)...dp₁,dp₀)

into an equivalent sequence

(dp'_(D-1) dp'_(D-2)...dp'₁,dp'₀)

such that each datapoint dp'_i maps to a memory bank mb_i where the corresponding memory bank sequence

(mb'_(D-1) mb'_(D-2)...mb'₀,mb'₁)

is the repeating series

(m0,m1,m2,m3, m0,m1,m2,m3 ... m0,m1,m2,m3),

and m0 = memory bank 00₂, m1 = memory bank 01₂, m2 = memory bank 10₂ and m3 = memory bank 11₂.

So:

Step 1: dp → dp'

Each datapoint dp_s,i in a given transform stage s ∈ (0 .. S-1) is an S-bit number

dp_s,i = d_S-1d_S-2...d₁d₀

where i ∈ (0..D-1).

If s = S-1 (final stage), we will replace each datapoint

dp_s,i = d_S-1d_S-2...d₁d₀

with a new datapoint

dp'_s,i = d_S-1d₀d_S-2...d₁

such that the lower S-2 bits of the original datapoint have been right-rotated by one bit.

For all other stages dp' is the same as dp. In other words:

	dp'_s,i = dp_s,i	∀ s ∈ (0 .. S - 2)	and i ∈ (0 .. D-1)
	dp'_s,i = d_S-1d₀d_S-2...d₁	∀ s = (S - 1)	and i ∈ (0 .. D-1)

Step 2: dp' → dp"

Having made this first transformation

dp'_s,i = ƒ(dp_s,i),

in Step 1 above, we can now redefine each dp'_s,i as a new datapoint dp"_s,i

dp"_s,i = d"_S-1d"_S-2...d"₁d"₀

where

if s ∈ (0 .. S-2) then	d"_s = P'_s and d"_(s+1) = P'_(s+1)	and all other d"_j = d'_j , where j ∈ (0..S-1) and j ∉ {s,s+1}
else (final stage s = S-1)	d"_s = P'_s and d"_s-1 = P'_s-1	and all other d"_j = d'_j , where j ∈ (1..S-2)

and

P'₀ = P'_even = Parity(d'_S-2,d'_S-4...d'₂,d'₀)

is the parity of the even bits of dp', that is all those bits d'_j such that j mod 2 = 0; and

P'₁ = P'_odd = Parity(d'_S-1,d'_S-3...d'₃,d'₁)

is the parity of the odd bits of dp', that is all those bits d'_j such that j mod 2 = 1; and

P'_k = P'₀ = P'_even	if k is even and
P'_k = P'₁ = P'_odd	if k is odd

Step 3?: m = P"₁P"0 ?

OOPS maybe need to add step 3 where Memory bank m = P"₁P"₀

Extension of the proof to more than one butterfly unit (B>1)

Having solved the problem for B=1 and M=4, the extension to larger values of B is fairly straightforward.

For our algorithm to work, there must always be four memory banks for each butterfly unit. For one butterfly (B=1) we had four memory banks (M=4) and thus two toggling bits. When there are two butterfly units (B=2) we have eight memory banks (M=8) and thus three toggling bits, and so on, adding one more toggling bit for each doubling of B.

Following the logic we used to develop the algorithm for B=1 above, we can continue to define d' = d for those stages where the toggling bits are adjacent and do not wrap around the high end of the address, as in e.g. stages 0 and 1 of a two-butterfly implementation of a 16-point transform with three toggling bits.

<caption>With three toggle bits, non-adjacency begins in Stage 2</caption>

	Stage 0	Stage 1	Stage 2	Stage 3

Oct 0	d00=0000	d00=0000	d00=0000	d00=0000
d01=0001	d02=0010	d04=0100	d08=1000
d02=0010	d04=0100	d08=1000	d01=0001
d03=0011	d06=0110	d12=1100	d09=1001

d04=0100	d08=1000	d01=0001	d02=0010
d05=0101	d10=1010	d05=0101	d10=1010
d06=0110	d12=1100	d09=1001	d03=0011
d07=0111	d14=1110	d13=1101	d11=1011

...	...	...	...	...
Oct 1	d14=1110	d13=1101	d11=1011	d07=0111
d15=1111	d15=1111	d15=1111	d15=1111

Note, extending the introductory explanation to larger values of B, because we have eight memory banks we now have two "octs" (groups of eight) with three toggle bits each instead of four "quads" (groups of four) with two toggle bits.

Stages 2 and 3 now both have non-adjacent toggle bits, and in each of these stages we have to wrap the lower address bits such that the low toggling bits adjoin the high toggling bits, as in the example below.

<caption>With three toggle bits, non-adjacency must be corrected in Stages 2 and 3</caption>

	(Stages 0 and 1 same as before)		(New/rotated Stages 2 and 3)
	Stage 0	Stage 1	Stage 2	Stage 3

Oct 0	d00=0000	d00=0000	d00=0000	d00=0000
d01=0001	d02=0010	d04=0100	d08=1000
d02=0010	d04=0100	d08=1000	d02=0010
d03=0011	d06=0110	d12=1100	d10=1010

d04=0100	d08=0001	d02=0010	d04=0100
d05=0101	d10=0011	d06=0110	d12=1100
d06=0110	d12=0101	d10=1010	d06=0110
d07=0111	d14=0111	d14=1110	d14=1110

...	...	...	...	...
Oct 1	d14=1110	d13=1101	d11=1011	d07=0111
d15=1111	d15=1111	d15=1111	d15=1111

Our original discussion for a one-butterfly FFT talked about quads mapping to four memory banks. For more than one butterfly we update this discussion with octs that map to eight memory banks, in the case of two butterfly units, or sixteens mapping to sixteen banks when B=4, etc. The algorithm works so long as there are four memory banks for every butterfly unit, i.e. M=4B.

Similarly, when B=2 (two butterflies) we have three toggle bits instead of two, and when B=4 we have four toggle bits, and so on. And instead of simply having odd and even parity, as in the original case where B=1, we need to define some new terms for parity, that is

P3₀ = the parity of every third bit including bit b₀, i.e. P3₀ = XOR(b₀, b₃, b₆, ...)
P3₁ = the parity of every third bit including bit b₁, i.e. P3₁ = XOR(b₁, b₄, b₇, ...)

or, more generally,

Pn_i = the parity of every nth bit b_j such that (j mod n) = (i mod n)

Note that, by this definition,

P_even = P2₀ = P2₂ = P2₄ = ...
P_odd = P2₁ = P2₃ = P2₅ = ...
P3₀ = P3₃ = P3₆ = P3₉ = ...
and so on

In the case where there was one butterfly unit (B=1), we could canonicalize the bank order by replacing the two toggle bits t_i and t_i+1 with parity bits P2_i and P2_i+1.

In the case where there are two butterfly units (B=2), there are now three toggle bits t_i, t_i+1 and t_i+2; to canonicalize the bank order (eight banks now instead of four), we replace these toggle bits with parity bits P3_i, P2_i+1 and P2_i+2

In general, then, when B=2ⁿ, there will be 4B memory banks and n+1 toggle bits t_it_i+1t_i+2..t_i+n which we replace with parity bits Pn_iPn_i+1Pn_i+2..Pn_i+n in order to remap the operand addresses such that each sequential group of M operands in a given stage have the exact same memory-bank sequence.

Example for more than one butterfly unit (B=2)

Below are two examples of the algorithm at work, calculating the schedules for Stages 0 and 3 of a sixteen-point transform using two radix-2 butterfly units.

Example mapping for stage s=0 of a transform for D=16 datapoints and two butterflies (B=2)

<caption>Example mapping for stage s=0 of a transform for D=16 datapoints and two butterflies (B=2)</caption>

Datapoint d=d₃d₂d₁d₀	Reorder 1 (no change) d' = d₃d₂d₁d₀	Parity (P=P3) P'₂,P'₁,P'₀	Reorder 2 (replace) d" = d'₃P'₂P'₁P'₀	Parity (P=P3) P"₂,P"₁,P"₀ ! Memory bank m = P"₂P"₁P"₀
d00= 0 0 0 0 d01= 0 0 0 1 d02= 0 0 1 0 d03= 0 0 1 1	0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1	0,0,0 0,0,1 0,1,0 0,1,1	0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03	0,0,0 0,0,1 0,1,0 0,1,1	m0 m1 m2 m3
d04= 0 1 0 0 d05= 0 1 0 1 d06= 0 1 1 0 d07= 0 1 1 1	0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1	1,0,0 1,0,1 1,1,0 1,1,1	0 1 0 0 = d04 0 1 0 1 = d05 0 1 1 0 = d06 0 1 1 1 = d07	1,0,0 1,0,1 1,1,0 1,1,1	m4 m5 m6 m7
d08= 1 0 0 0 d09= 1 0 0 1 d10= 1 0 1 0 d11= 1 0 1 1	1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1	0,0,1 0,0,0 0,1,1 0,1,0	1 0 0 1 = d09 1 0 0 0 = d08 1 0 1 1 = d11 1 0 1 0 = d10	0,0,0 0,0,1 0,1,0 0,1,1	m0 m1 m2 m3
d12= 1 1 0 0 d13= 1 1 0 1 d14= 1 1 1 0 d15= 1 1 1 1	1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1	1,0,1 1,0,0 1,1,1 1,1,0	1 1 0 1 = d13 1 1 0 0 = d12 1 1 1 1 = d15 1 1 1 0 = d14	1,0,0 1,0,1 1,1,0 1,1,1	m4 m5 m6 m7

Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and two butterflies (B=2)

<caption>Example mapping for final stage s=S-1 (Stage 3) of a transform<br /> for D=16 datapoints and two butterflies (B=2)</caption>

Datapoint d=d₃d₂d₁d₀	Reorder 1 (rotate) d' = d₃d₁d₀d₂	Parity (P=P3) P'₂,P'₁,P'₀	Reorder 2 (replace) d" = P'₃P'₂P'₁d'₀	Parity (P=P3) P"₂,P"₁,P"₀ ! Memory bank m = P"₂P"₁P"₀
d00= 0 0 0 0 d08= 1 0 0 0 d01= 0 0 0 1 d09= 1 0 0 1	0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0	0,0,0 0,0,1 0,1,0 0,1,1	0 0 0 0 =d00 1 0 0 0 =d08 0 0 1 0 =d02 1 0 1 0 =d10	0,0,0 0,0,1 0,1,0 0,1,1	m0 m1 m2 m3
d02= 0 0 1 0 d10= 1 0 1 0 d03= 0 0 1 1 d11= 1 0 1 1	0 1 0 0 1 1 0 0 0 1 1 0 1 1 1 0	1,0,0 1,0,1 1,1,0 1,1,1	0 1 0 0 =d04 1 1 0 0 =d12 0 1 1 0 =d06 1 1 1 0 =d14	1,0,0 1,0,1 1,1,0 1,1,1	m4 m5 m6 m7
d04= 0 1 0 0 d12= 1 1 0 0 d05= 0 1 0 1 d13= 1 1 0 1	0 0 0 1 1 0 0 1 0 0 1 1 1 0 1 1	0,0,1 0,0,0 0,1,1 0,1,0	1 0 0 1 =d09 0 0 0 1 =d01 1 0 1 1 =d11 0 0 1 1 =d03	0,0,0 0,0,1 0,1,0 0,1,1	m0 m1 m2 m3
d06= 0 1 1 0 d14= 1 1 1 0 d07= 0 1 1 1 d15= 1 1 1 1	0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1	1,0,1 1,0,0 1,1,1 1,1,0	1 1 0 1 =d13 0 1 0 1 =d05 1 1 1 1 =d15 0 1 1 1 =d07	1,0,0 1,0,1 1,1,0 1,1,1	m4 m5 m6 m7

Algorithm 2: More than one radix-2 butterfly (R=2 and B>1)

Given a transform with D datapoints such that D is a power of two and D ≥ 4; [SR:] if the algorithm is to work correctly for any power-of-two number of butterfly units B=2ⁿ, we should have 4B memory banks (M=4B) numbered 0 to (4B-1), or 00...00₂ to 11...11₂ i.e.

Bank Name	Bank Number m	Bank Bits m_n+1m_nm_n-1...m₀
m0	(m=0)	00...00₂ = 0
m1	(m=1)	00...01₂ = 1
↓	↓	↓
mn	(m=4B-2)	11...10₂ = 4B-2
m(n+1)	(m=4B-1)	11...11₂ = 4B-1

Each operand address dp_s,i in a given stage s has T toggle bits where T=log₂M

What we're gonna do (B>1)

The following equation(s) will show how to transform a given stage's datapoint sequence

(dp_(D-1) dp_(D-2)...dp₁,dp₀)

into an equivalent sequence

(dp'_(D-1) dp'_(D-2)...dp'₁,dp'₀)

such that each datapoint dp'_i maps to a memory bank mb_i where the corresponding memory bank sequence

(mb'_(D-1) mb'_(D-2)...mb'₀,mb'₁)

is the repeating series

(

(m0,m1,m2 ... m(4B-1)),

...)

and m0 = memory bank 00...00₂, m1 = memory bank 00...01₂, m(4B-2) = memory bank 11...10₂ and m(4B-1) = memory bank 11...11₂.

So:

Step 1: dp → dp' (B>1)

Each datapoint dp_s,i in a given transform stage s ∈ (0 .. S-1) is an S-bit number

dp_s,i = d_S-1d_S-2...d₁d₀

where i ∈ (0..D-1).

If s ∈ (S-(T-1)..S-1) (i.e. final T-1 stages), we will replace each datapoint

dp_s,i = (d_S-1d_S-2...d_s)(d_s-1d_s-2...d_T-1)(d_T-2d_T-3...d₀)

with a new datapoint

dp'_s,i = (d_S-1d_S-2...d_s)(d_T-2d_T-3...d₀)(d_s-1d_s-2...d_T-1)

such that the lower s-1 bits of the original datapoint have been right-rotated by (T + (s-S)) bits.

For all other stages dp' is the same as dp. In other words:

	dp'_s,i = dp_s,i	∀ s ∈ (0 .. S-T)	and i ∈ (0 .. D-1)
	dp'_s,i = (d_S-1d_S-2...d_s)(d_T-2d_T-3...d₀)(d_s-1d_s-2...d_T-1)	∀ s ∈ (S - (T-1) .. S-1)	and i ∈ (0 .. D-1)

Step 2: dp' → dp" (B>1)

Having made this first transformation

dp'_s,i = ƒ(dp_s,i),

in Step 1 above, we can now redefine each dp'_s,i as a new datapoint dp"_s,i

dp"_s,i = d"_S-1d"_S-2...d"₁d"₀

where

if s ∈ (0 .. S-T) then	d"_sd"_s+1d"_s+2..d"_s+(T-1) = P'_sP'_s+1P'_s+2..P'_s+(T-1)	and all other d"_j = d'_j
else (final T-1 stages)	d"_sd"_{s -1}d"_{s -2}..d"_{s -(T-1)} = P'_sP'_{s -1}P'_{s -2}..P'_{s -(T-1)}	and all other d"_j = d'_j

and

P'_i = PT_i = the parity of every T^th bit d'_j such that (j mod T) = (i mod T)

e.g. when T=3 then

P'₀ = P3₀ = XOR(d'₀, d'₃, d'₆, ...)

P'₁ = P3₁ = XOR(d'₁, d'₄, d'₇, ...)

P'₂ = P3₂ = XOR(d'₂, d'₅, d'₈, ...)

P'₃ = P3₀ = XOR(d'₀, d'₃, d'₆, ...)

etc.

Extension to radix R>2, longer pipelines

Extension to R>2

I think the case for radix R>2 is exactly analogous to the case where B>1. That is, for e.g. a one-butterfly radix-four operation (B=1 and R=4), the operands are grouped by four in exactly the way that they would be grouped for a two-butterfly radix-two operation (B=2 and R=2). In other words, the grouping for [TODO/FIXME]

For radix-2 operation we needed M=4B memory banks for groups of 4B operands. In the general case, where R can be any power of 2 greater than or equal to 2, we need M=4B(R-2) memory banks. Note also that S and s must change appropriately; the remainder of the proof for radix R>2 is left as an exercise to the reader.

Extension to longer pipelines.

Oops still need to write something here maybe.

Nomenclature and Terminology

Nomenclature (Symbols)

We will try to use symbols consistently throughout this writeup, as described by this alphabetically-ordered reference table.

Italicized symbols (B, D, d, M) represent numbers, non-italicized symbols (d0, m3) are names.

lower-case symbols (d, m) are variable, UPPER-CASE symbols (D, M) are constant (for a given FFT).

Symbol	Meaning	Notes
B	How many butterfly units	A power of two, e.g. 8, 16, 1024 or 8096. To complete a transform, an FFT with e.g. one butterfly unit (B=1) takes about four times as long as an FFT with four butterfly units (B=4).
D	How many datapoints	Number of datapoints to transform. A power of two, e.g. 8, 16, 1024 or 8096.
d	Datapoint	An eight-point transform has datapoints d=0 through d=7.
di	A specific datapoint	A specific datapoint i. For example, d4 is the datapoint d such that d == 4. An eight-point transform has datapoints d0 through d7 (see sched(s) below).
d_i	Datapoint bit number i	d₀ is always the least-significant bit (LSB). Example: datapoint d13 has four bits d₃d₂d₁d₀ = 1101₂
dp_s,i	Datapoint i in stage s	Same as sched(s)_i (see below). Each datapoint dp_s,i in a given transform stage s ∈ (0 .. S-1) is an S-bit number dp_s,i = d_S-1d_S-2...d₁d₀, where i ∈ (0..D-1).
dp_i	dp_s,i with implied s	Shorthand for dp_s,i where exact stage s is implied by surrounding context.
i, j, k, n	Unknown integer	We usually start numbering from i=0
M	How many memory banks	For purposes of this proof M == 4B always.
m	Memory bank	Generally need four memory banks per butterfly e.g. m=0 through m=7 for two butterflies (B=2).
mi	A specific memory bank	Generally need four memory banks per butterfly e.g. m0 through m7 for two butterflies (B=2).
m_i	Memory bank bit number i	Generally need four memory banks per butterfly e.g. m₂m₁m₀={000,001 ... 111} for two butterflies (B=2).
Mbits	Bits per bank number	How many bits in each mem bank number m; Mbits = log₂(M)
mb_s,i	Memory bank of dp_s,i	The memory bank corresponding to datapoint i of stage s.
mb_i	mb_s,i with implied s	Shorthand for mb_s,i where exact stage s is implied by surrounding context. For a one-butterfly eight-point transform, our algorithm results in three stages Stage 0: (mb₀,mb₁,mb₂,mb₃,mb₄,mb₅,mb₆,mb₇) = (m0,m1,m2,m3, m0,m1,m2,m3) Stage 1: (mb₀,mb₁,mb₂,mb₃,mb₄,mb₅,mb₆,mb₇) = (m0,m2,m1,m3, m0,m2,m1,m3) Stage 2: (mb₀,mb₁,mb₂,mb₃,mb₄,mb₅,mb₆,mb₇) = (m0,m1,m2,m3, m0,m1,m2,m3)
P_od, P_odd	Parity of odd bits	Example: decimal 13 has four bits d₃d₂d₁d₀ = 1101; thus P_od is parity(d₃,d₁) = parity(1,0) = 1 Note P_odd is the same as P2₁ (below).
P_ev, P_even	Parity of even bits	Example: decimal 13 has four bits d₃d₂d₁d₀ = 1101; thus P_ev is parity(d₂,d₀) = parity(1,1) = 0 Note P_even is the same as P2₀ (below).
Pn_i	Parity of periodic bits	The parity of every nth bit b_j such that (j mod n) = (i mod n) Example: P3₀ = the parity of every third bit including bit b₀, i.e. P3₀ = XOR(b₀, b₃, b₆, ...) Note that, by this definition, P3₀ = P3₃ = P3₆ = P3₉ and so on.
R	Radix	Default is 2
S	How many stages	Number of stages in a transform, e.g. an eight-point transform has three stages, so S=3
s	Stage number	An eight-point transform has three stages s=0, s=1 and s=2.
sched(s)	Schedule for stage s	The FFT schedule for stage s of a transform. E.g. for an 8-point transform there are three stages sched(0) = (d0,d1,d2,d3,d4,d5,d6,d7) sched(1) = (d0,d2,d4,d6,d1,d3,d5,d7) and sched(2) = (d0,d4,d1,d5,d2,d6,d3,d7)
sched(s)_i	i^th element of sched(s)	The i^th element of sched(s), e.g. in an 8-point transform, sched(1)₁ = d2 (see above).
T	How many toggle bits	The number of toggle bits in each datapoint address. T == log₂M
t_i	Toggle bit number i	t₀ is always the fastest-toggling (least significant) toggle bit. Example: a one-butterfly eight-point transform has two toggle bits. In Stage 0, t₁t₀ = d₁d₀; in Stage 1, t₁t₀ = d₂d₁; and in Stage 2, t₁t₀ = d₀d₂.

Terminology

Term	Meaning	Notes
LSB	Least-significant bit	The least-significant bit in the binary representation of a given number. All even integers have LSB = 0 and all odd integers have LSB = 1.
MSB	Most-significant bit	The most-significant bit of a number depends on the number of bits used to represent the binary form of that number. E.g. if 20₁₀ is represented using five bits 10100₂, the MSB is 1, but if it is represented as eight bits 0001 0100₂ the MSB is 0.
oct?	group of eight? aligned vs. unaligned?
quad?	group of four? aligned vs. unaligned?
others?

Appendix: Perl program that implements the algorithm

The algorithm as implemented in the Perl test program (FIXME/TODO: NOTE BENE THIS IS NOT CORRECT, NEEDS SEVERE UPDATING)

&#35; Normal treatment for stages 0 through nstages&#45;2
if ($s &#33;&#61; ($nstages&#45;1)) &#123;
    $op1 &#61; smart_replace($op1, $s)&#59;
    $op1 &#61; smart_replace($op1, $s+1)&#59;

    $op2 &#61; smart_replace($op2, $s)&#59;
    $op2 &#61; smart_replace($op2, $s+1)&#59;
&#125;

&#35; Last stage nstages&#45;1 needs special treatment.
elsif ($s%2 &#61;&#61; 1) &#123;

    &#35; Even number of bits&#59; use zero instead of s+1 maybe.
    $op1 &#61; smart_replace($op1, $s)&#59;
    $op1 &#61; smart_replace($op1, 0)&#59;

    $op2 &#61; smart_replace($op2, $s)&#59;
    $op2 &#61; smart_replace($op2, 0)&#59;
&#125;
else &#123;
    &#35; Odd number of bits&#59; gotta do sumpm crazy maybe.
    $op1 &#61; weird_replace($op1, $s)&#59;
    $op2 &#61; weird_replace($op2, $s)&#59;
&#125;

&#35; Swizzle to assign correct bank &amp; row to $op1, $op2
my ($b1, $r1) &#61; swizzler&#58;&#58;swizzle($op1, $npoints, 4&#42;$nunits)&#59;
my ($b2, $r2) &#61; swizzler&#58;&#58;swizzle($op2, $npoints, 4&#42;$nunits)&#59;

&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;&#35;
sub smart_replace &#123;
    &#35; In the given number $n, replace bit $i with new bit $b
    &#35; If $i is  odd, new bit $b is the parity of the  odd bits of $n
    &#35; If $i is even, new bit $b is the parity of the even bits of $n
    &#35;
    my $n &#61; shift&#59;
    my $i &#61; shift&#59;

    my $newbit &#61; $i%2 ? parity_odd($n) &#58; parity_even($n)&#59;
    $n &#61; replace_bit($n, $i, $newbit)&#59;    &#35; Put $newbit into $n at position $i
    return $n&#59;
&#125;

sub weird_replace &#123;
    &#35; Given number $n and ultimate stage $s, where $s is even, do some weird $&#35;&#33;?
    my $n &#61; shift&#59;
    my $s &#61; shift&#59;

    &#35; This is supposed to work at least for when nbits(datapoints) is odd.  i think.
    my $b0 &#61; ($n &amp; 0x1)&#59;
    my $b1 &#61; ($n &amp; 0x2) &gt;&gt; 1&#59;

    my $new_top_bit &#61; parity_even($n) ^ $b1&#59;
    my $new_b1 &#61; parity_odd($n) ^ $b1 ^ $b0&#59;
    my $new_b0 &#61; $b1 ^ $b0&#59;  &#35; Complicated enough yet&#33;?

    $n &#61; replace_bit($n,$s,$new_top_bit)&#59; &#35; Put $new_top_bit into $n at position $s
    $n &#61; replace_bit($n, 1,$new_b1)&#59;      &#35; Put $new_b1 into $n at position 1
    $n &#61; replace_bit($n, 0,$new_b0)&#59;      &#35; Put $new_b0 into $n at position 0 (LSB right?)

    return $n&#59;
&#125;

Major caveat(s), to-do and notes

Caveat: The Stanford FFT generator, which I authored, may not yet actually use the algorithm that results directly from this proof. But I plan to do that eventually...!

Todo:

present to the group and invite comments including where/whether to submit to journal, conference, etc.
update the Perl program in the appendix.
change to latex format for publication (replace html page with a latex page).
make sure FFT generator (correctly) uses algorithm described in the formal proof!

Other Notes, Trash and Auxiliary Info

Projects: FFT Formal Proof v0 (old) - Old formal proof with separate boundary conditions for odd and even number of bits, etc.

Projects: FFT Formal Proof - Auxiliary Info

Projects: FFT_Formal_Proof - Auxiliary_Info#Trash

Original hand-written notes

Hand-written notes where the proof originally was developed:

https://vlsiweb.stanford.edu/mediawiki/index.php/File:Formal_proof_notes.pdf

Projects: FFT Formal Proof - StanfordVLSI/Genesis2 GitHub Wiki

Table of Contents

DO NOT READ THIS PAPER!!!

The Problem

Developing the proof for one butterfly unit (B=1)

Key Insights

Insight 1: We can ensure that each of the four memory banks m0, m1, m2 and m3 are represented in every aligned quad, in some order

Insight 2: We can reorder the datapoints such that each quad contains exactly four memory banks m0, m1, m2 and m3 in a consistent order

Summary

Boundary condition: non-adjacent toggle bits

The final-stage problem of non-adjacent toggle bits

Solving the final-stage problem of non-adjacent toggle bits

Example for one butterfly unit (B=1)

Example mapping for stage s=0 of a transform for D=16 datapoints and one butterfly (B=1)

Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and one butterfly (B=1)

Algorithm 1: One radix-2 butterfly (R=2 and B=1)

What we're gonna do.

Step 1: dp → dp'

Step 2: dp' → dp"

Step 3?: m = P"₁P"0 ?

Extension of the proof to more than one butterfly unit (B>1)

Example for more than one butterfly unit (B=2)

Example mapping for stage s=0 of a transform for D=16 datapoints and two butterflies (B=2)

Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and two butterflies (B=2)

Algorithm 2: More than one radix-2 butterfly (R=2 and B>1)

What we're gonna do (B>1)

Step 1: dp → dp' (B>1)

Step 2: dp' → dp" (B>1)

Extension to radix R>2, longer pipelines

Extension to R>2

Extension to longer pipelines.

Nomenclature and Terminology

Nomenclature (Symbols)

Terminology

Appendix: Perl program that implements the algorithm

Major caveat(s), to-do and notes

Other Notes, Trash and Auxiliary Info

Original hand-written notes

⚠️ GitHub.com Fallback ⚠️

Projects: FFT Formal Proof - StanfordVLSI/Genesis2 GitHub Wiki

Table of Contents

DO NOT READ THIS PAPER!!!

The Problem

Developing the proof for one butterfly unit (B=1)

Key Insights

Insight 1: We can ensure that each of the four memory banks m0, m1, m2 and m3 are represented in every aligned quad, in some order

Insight 2: We can reorder the datapoints such that each quad contains exactly four memory banks m0, m1, m2 and m3 in a consistent order

Summary

Boundary condition: non-adjacent toggle bits

The final-stage problem of non-adjacent toggle bits

Solving the final-stage problem of non-adjacent toggle bits

Example for one butterfly unit (B=1)

Example mapping for stage s=0 of a transform for D=16 datapoints and one butterfly (B=1)

Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and one butterfly (B=1)

Algorithm 1: One radix-2 butterfly (R=2 and B=1)

What we're gonna do.

Step 1: dp → dp'

Step 2: dp' → dp"

Step 3?: m = P"1P"0 ?

Extension of the proof to more than one butterfly unit (B>1)

Example for more than one butterfly unit (B=2)

Example mapping for stage s=0 of a transform for D=16 datapoints and two butterflies (B=2)

Example mapping for final stage s=S-1 (Stage 3) of a transform for D=16 datapoints and two butterflies (B=2)

Algorithm 2: More than one radix-2 butterfly (R=2 and B>1)

What we're gonna do (B>1)

Step 1: dp → dp' (B>1)

Step 2: dp' → dp" (B>1)

Extension to radix R>2, longer pipelines

Extension to R>2

Extension to longer pipelines.

Nomenclature and Terminology

Nomenclature (Symbols)

Terminology

Appendix: Perl program that implements the algorithm

Major caveat(s), to-do and notes

Other Notes, Trash and Auxiliary Info

Original hand-written notes

⚠️ **GitHub.com Fallback** ⚠️

Step 3?: m = P"₁P"0 ?

⚠️ GitHub.com Fallback ⚠️