Table of Contents Major caveat(s) The Problem Key Insights Insight 1: We can ensure that each of the four memory banks m0, m1, m2 and m3 are represented in every quad, in some order Insight 2: We can reorder the quads such that each one contains the four memory banks m0, m1, m2 and m3 in a consistent order Boundary condition, even number of bits Algorithm 1: One radix-2 butterfly (R==2 and B==1) and even number of address bits SCRUBBED TO HERE Example mapping for stage s=0 of a transform for N=16 datapoints TODO/NEXT/CONTINUE FROM HERE ------------------------------------------- Boundary condition, odd number of bits Final equation for when nbits is odd and R==2 and B==1 Algorithm 2: Final equation(s) for R=2 and B=1 The ... REST ... of the story? Extension to more than one butterfly unit (B > 1). Generalization of Algorithm 1, extended for any B=2ⁱ and any number of address bits S: mod_MbitsS=0 everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above Extension to radix R > 2. Extension to longer pipelines. Nomenclature (Symbols) Notes/TODO Housekeeping/Schedule (or: how/when I'm gonna be done with this) Housekeeping/TODO list TODO: LaTeX TODO: Notification list TODO: Nomenclature TODO: FFT Generator Notes Other Notes, Trash and Auxiliary Info Developing the basic equation for one radix-2 butterfly (R==2 and B==1) The equation so far Example another trash block another trash block

Major caveat(s)

Caveat 1: The proof is obviously unfinished. I believe there is enough information in what's been written so far to properly construct the proof, but clearly I haven't yet made the time to do that myself. But I'm actively working on it! Come back later if you want to see the complete final proof. Or send me mail and I can notify you when it's done.

Caveat 2: The Stanford FFT generator, which I authored, may not yet actually use the algorithm that results directly from this proof. But I plan to do that eventually...!

The Problem

A detailed description of the problem can be found in the 2013 VLSI-SoC paper

An area-efficient minimum-time FFT schedule using single-ported memory, Richardson, Shacham, Markovic and Horowitz. (To appear in) Proceedings of the 21st IFIP/IEEE International Conference on VLSI (VLSI-SoC), 2013.

To summarize the problem here: take a standard schedule of operations for a Fast Fourier Transform processor (FFT) based on a radix-2 Cooley-Tukey algorithm, e.g. for N=8 datapoints and one butterfly unit, as shown in Figure 1.

FIGURE 1

That is, suppose we have eight datapoints d0 through d7 that we want to transform from the time domain to the frequency domain. In Stage 0 of operation, a radix-2 butterfly unit reads datapoints d0 and d1, producing two results that then overwrite the original locations of d0 and d1 in memory. Similar operations occur for datapoint pairs (d2,d3), (d4,d5) and (d6,d7) respectively. In Stage 1, the datapoints are processed with stride-two pairings e.g. (d0,d2), (d4,d6), (d1,d3) and (d5,d7). And finally Stage 2 completes the operation, re-walking the data by stride 4. Table 1 summarizes this access pattern.

 TABLE 1
 Stage 0: (d0,d1) (d2,d3) (d4,d5) (d6,d7)
 Stage 1: (d0,d2) (d4,d6) (d1,d3) (d5,d7)
 Stage 2: (d0,d4) (d1,d5) (d2,d6) (d3,d7)

If the butterfly unit is pipelined such that result writebacks occur at the same time as operand reads, then datapoints (d0,d1) get written to memory at the same time as (d2,d3) are being read from memory. For memory to be able to supply all four datapoints at once, it needs to be four-ported. Alternatively, we could try to use two separate banks of two-ported memory or even four banks of single-ported memory. Of these three choices, four banks of single-ported memory is most efficient in terms of area, so we'd like to make that work if possible.

Now: Within any stage, look at any sequential group of four operands starting on a mod-two boundary. We'll call such a group a quad. If the quad happens to start on a mod-four boundary, we'll call this an aligned quad. Stage 0 has three overlapping quads: (d0,d1,d2,d3), (d2,d3,d4,d5) and (d4,d5,d6,d7), the outside two of which are aligned quads. Our goal is to map each datapoint to one of four memory banks (m0,m1,m2,m3) such that every quad contains exactly one of each bank number. For example, we could map datapoints (d0,d1,d2,d3,d4,d5,d6,d7) to memory banks (m0,m1,m2,m3,m0,m1,m2,m3) respectively, and thus fulfill our goal for each of the three quads in Stage 0:

(d0,d1,d2,d3) maps to (m0,m1,m2,m3)
(d2,d3,d4,d5) maps to (m2,m3,m0,m1)
(d4,d5,d6,d7) maps to (m0,m1,m2,m3)

The first quad in Stage 1, however, fails because its datapoints (d0,d2,d4,d6) would map to (m0,m2,m0,m2). Banks m1 and m3 are not represented in the quad, while m0 and m2 each appear twice.

We're going to fulfill our goal by enforcing an order to each aligned quad, such that every aligned quad within a stage has the same memory bank ordering e.g. (m0,m1,m2,m3). This means that every unaligned quad will also have a consistent ordering, e.g. (m2,m3,m1,m0), and thus we meet our goal of uniqueness for all quads. This property will hold whether the transform is operating on eight datapoints, with two aligned quads per stage, or eight thousand datapoints, with two thousand quads per stage.

In theory the four operations in a given stage all happen simultaneously. In practice, it's not always practical to have enough butterfly units for that. So for instance, if we only have one butterfly unit, it will have to be used four times in succession to complete the transformation for a given stage, using the following four operations:

read, process and writeback datapoints d0 and d1;
read, process and writeback datapoints d2 and d3;
read, process and writeback datapoints d4 and d5;
read, process and writeback datapoints d6 and d7.

In pipelined operation, the writeback of each previous data pair occurs at the same time as the read of each new pair of datapoints. Thus, for example datapoints (d0,d1,d2,d3) will all have to be accessible at the same time within the same cycle, as will the datapoints in every quad. We want to use single-ported memory for the datapoints, thus our requirement that the datapoints in each quad must all live in separate memory banks (and assuming, of course, that each memory bank has its own separate path to the butterfly unit).

Key Insights

Insight 1: We can ensure that each of the four memory banks m0, m1, m2 and m3 are represented in every quad, in some order

Once again, here is our schedule for N=8 datapoints:

Stage 0: (d0,d1) (d2,d3) (d4,d5) (d6,d7)
Stage 1: (d0,d2) (d4,d6) (d1,d3) (d5,d7)
Stage 2: (d0,d4) (d1,d5) (d2,d6) (d3,d7)

Looking carefully at the quads, we see that they all share a unique property. For example, consider the quad represented by the four operands in the last two operations of Stage 1: (d1,d3),(d5,d7). Looking at the binary representation of the indices within any such group of four, only two bits change, the bits d_s+1 and d_s where s is the stage number and bit d₀ is the least significant bit, or LSB. We call these two bits the toggling bits, and they always count 00, 01, 10, 11 within an aligned quad. In our example, the binary indices for (d1,d3),(d5,d7) are (00001,00011),(00101,00111). The toggling bits d₂ and d₁ count (00,01),(10,11) while the non-toggling bits d₄d₃d₀ are always 001.

Given this information, we can assign banks to operands such that banknum = PoPe where Po is the parity of ALL odd bits and Pe is the parity of ALL even bits. This means that any given sequential block of four operands will map to four separate banks.

In our example quad [(d1,d3),(d5,d7)], or [(00001,00011),(00101,00111)]

d1: Po(00001) = parity(0,0) = 0; Pe(00001) = parity(0,0,1) = 1; PoPe = 01; banknum m = 1
d3: Po(00011) = parity(0,1) = 1; Pe(00011) = parity(0,0,1) = 1; PoPe = 11; banknum m = 3
d5: Po(00101) = parity(0,0) = 0; Pe(00101) = parity(0,1,1) = 0; PoPe = 00; banknum m = 0
d7: Po(00111) = parity(0,1) = 1; Pe(00111) = parity(0,1,1) = 0; PoPe = 10; banknum m = 2

Insight 2: We can reorder the quads such that each one contains the four memory banks m0, m1, m2 and m3 in a consistent order

When the stage number s is even, then Pe within any group of four will be either (0,1,0,1) or (1,0,1,0) depending on the (constant 0 or 1) parity of the non-toggling bits. Why? Because we know that bit d_s within any group of four always counts (0,1,0,1) while the remaining (non-toggling) even bits remain constant.

Similarly, Po will be either (0,0,1,1) or (1,1,0,0). Bank number PoPe for a quad can thus be one of the following four groups: (00,01,10,11); (01,00,11,10); (10,11,00,01); or (11,10,01,00).

When the stage number is odd, we know that b(s) is an odd bit and thus it is Po that will be either (0,1,0,1) or (1,0,1,0) while Pe is (0,0,1,1) or (1,1,0,0). Bank number PoPe for a quad will thus be one of: (00,10,01,11); (01,11,00,10); (10,00,11,01); or (11,01,10,00).

Suppose, that, when stage number s is even and when Pe is going to count (1,0,1,0), we change d_s (a component of Pe because s is even) such that it counts (1,0,1,0) instead of (0,1,0,1). In other words, instead of (d1,d3,d5,d7), where d₁ goes (0,1,0,1) we change the sequence to (d5,d7,d1,d3), so that d₁ goes (1,0,1,0). Is it okay to change the sequence like this? Yes it is because this is a valid reordering, as discussed above. And now Pe for the new ordering counts (0,1,0,1) instead of the original (1,0,1,0).

Further suppose that, when stage number s is even and when Po is about to count (1,1,0,0), we change d_s+1 (a component of Po, because we know s+1 is odd) such that it counts (1,1,0,0) instead of (0,0,1,1). (In our example sequence (d1,d3,d5,d7) we would not need to make this change, because Po already counts the way we want it to.) In general, is it okay to change the sequence like this when needed? Yes it is because within any given group of four, the order of operations is unimportant; also the order of pairs within a given operation is unimportant. In other words, looking again at the last four operands in Stage 1 of our example above, the given order is (d1,d3)(d5,d7) but all EIGHT of the following possible sequences are equally valid:

 * (d1,d3)(d5,d7);  (d1,d3)(d7,d5);  (d3,d1)(d5,d7);  (d3,d1)(d7,d5); (d5,d7)(d1,d3);  (d5,d7)(d3,d1);  (d7,d5)(d1,d3);  (d7,d5)(d3,d1).

Let's define P_i such that

P_i = Pe when i is even, and
P_i = Po when i is odd

Now we see that WE CAN ALWAYS REPLACE d_s+1 AND d_s WITH P_s+1 AND P_s RESPECTIVELY TO CANONICALIZE THE BANK ORDER. In particular, given this transformation, banks in an aligned even-stage quad will always count (00,01,10,11) and the banks in an aligned odd-stage quad will always count (00,10,01,11).

In our example, (d1,d3,d5,d7) is part of an odd stage; Po=(0,1,0,1) and Pe=(1,1,0,0). Because Pe counts incorrectly (1,1,0,0) instead of (0,0,1,1), we replace b(s+1) with Pe to come up with a new ordering (d5,d7,d1,d3). Now Po=(0,1,0,1) and Pe=(0,0,1,1) and final bank number assignments read (00,10,01,11), which is the correct canonical order for an odd stage.

Does this make sense yet?

This is the core of the proof. However, we still need to place it in a formal context. Also, we must discuss:

the special cases for the boundary condition when s and s+1 wrap around at the end, that is when s == S-1.
how the proof extends to any power-of two values for B, N or r maybe;
and, for advanced work, we should mention how the proof extends to any pipeline depth.

Boundary condition, even number of bits

Note that our insight regarding quads changes slightly in the final stage of operation, when s = S-1. In all previous stages up to that point, the toggling bits are adjacent (see example below).

Stage 0	Stage 1	Stage 2	Stage 3

d0=0000	d0=0000	d0=0000	d0=0000
d1=0001	d2=0010	d2=0100	d4=1000
d2=0010	d4=0100	d4=1000	d1=0001
d3=0011	d6=0110	d6=1100	d5=1001

d4=0100	d1=0001	d1=0001	d2=0010
d5=0101	d3=0011	d3=0101	d6=1010
d6=0110	d5=0101	d5=1001	d3=0011
d7=0111	d7=0111	d7=1101	d7=1011

...	...	...	...
d7=1110	d7=1101	d7=1011	d7=0111
d7=1111	d7=1111	d7=1111	d7=1111

At the final stage s=S-1, however, the toggling bits wrap around, such that the toggling bits are the LSB and the MSB respectively, i.e. bits d₀ and d_s instead of the usual bits d_s+1 and d_s (because in this case d_s+1 does not exist).

This wrap works fine if the number of bits is even, in which case d₀ is naturally part of P_e and d_s part of P_o, same as in the case where the toggling bits are adjacent. Later we will discuss the case where then number of bits is odd.

Algorithm 1: One radix-2 butterfly (R==2 and B==1) and even number of address bits

Hmm. The derivation below is WRONG. It should go more like this:

Given

a transform with N datapoints such that N is a power of two and N ≥ 16; and
an even number of stages S, that is, S=log₂(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).

For the algorithm to work correctly for one butterfly (B=1), we should have four memory banks (M=4) numbered 00₂ to 11₂ i.e. m₁m₀={00₂,01₂,10₂,11₂} i.e. m0=00₂, m1=01₂, m2=10₂ and m3=11₂.

The following equation(s) transform a given stage's datapoint sequence

(dp_(N-1) dp_(N-2)...dp₁,dp₀)

into an equivalent sequence

(dp'_(N-1) dp'_(N-2)...dp'₁,dp'₀)

such that each datapoint dp'_i maps to a memory bank mb_i where the corresponding memory bank sequence

(mb'_(N-1) mb'_(N-2)...mb'₀,mb'₁)

is the repeating series

(m0,m1,m2,m3, m0,m1,m2,m3 ... m0,m1,m2,m3),

where m0 = memory bank 00₂, m1 = memory bank 01₂, m2 = memory bank 10₂ and m3 = memory bank 11₂.

So:

For each S-bit datapoint dp_s,i in a given transform stage s ∈ (0..S-1),

dp_s,i = d_S-1d_S-2...d₁d₀

and i ∈ (0..N-1), redefine it as a new datapoint

dp'_s,i = d'_S-1d'_S-2...d'₁d'₀

where

if (s==S-1) then	d'₀ = P₀ and d'_s = P₁	and all other d'_j = d_j , where j ∈ (1..S-2)
else	d'_s = P_s and d'_(s+1) = P_(s+1)	and all other d'_j = d_j , where j ∈ (0..S-1) and j ∉ {s,s+1}

and

P_k = P₀ = P_even	if k is even and
P_k = P₁ = P_odd	if k is odd

and

P₀ = P_even = Parity(d_S-2,d_S-4...d₂,d₀)

is the parity of the datapoint's even bits, that is all those bits d_j such that j mod 2 = 0; and

P₁ = P_odd = Parity(d_S-1,d_S-3...d₃,d₁)

is the parity of the datapoint's odd bits, that is all those bits d_j such that j mod 2 = 1.

The equations work for

a transform with N datapoints such that N is a power of two and N ≥ 16; and
an even number of stages S, that is, S=log₂(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).

That is, they work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.

SCRUBBED TO HERE

Example mapping for stage s=0 of a transform for N=16 datapoints

**Example mapping for stage s=0 of a transform for N=16 datapoints**
Datapoint d (decimal)	Address (binary) = d₃d₂d₁d₀	P₁,P₀	Datapoint d' = Replace d_s+1d_s (d₁d₀) with P₁P₀ (reorder)	P'₁,P'₀	bank = P'₁P'₀
d00 d01 d02 d03	0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1	0,0 0,1 1,0 1,1	0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d04 d05 d06 d07	0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1	0,1 0,0 1,1 1,0	0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d08 d09 d10 d11	1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1	1,1 1,0 0,1 0,0	1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d12 d13 d14 d15	1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1	1,0 1,1 0,0 0,1	1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13	0,0 0,1 1,0 1,1	m0 m1 m2 m3

TODO/NEXT/CONTINUE FROM HERE -------------------------------------------

Boundary condition, odd number of bits

If the number of bits is odd, we have a slight problem. Everything is fine as before from stage 0 up through the penultimate stage S-2. But then things change for the final stage S-1, where the toggling bits are the two outside bits d_S-1 (fast toggle 0,1,0,1) and d₀ (slow toggle 0,0,1,1). In the previous case, where there were an even number of address bits, we knew that d_S-1 was odd and d₀ was even, and thus P_odd would toggle 0,1,0,1 (or 1,0,1,0, depending on the value of the non-toggling bits) while P_even would toggle 0,0,1,1 (or 1,1,0,0).

In the case where the number of bits is odd, however, the two outside (toggling) bits d_S-1 and d₀ are both even. Thus, as d₀d_S-1 count (00,01,10,11), P_odd does not change at all (because the toggling bits are both even) while P_even counts either (0,1,1,0) (or (1,0,0,1), depending on the value of the non-toggling bits) instead of its usual 0-0-1-1 (or 1-1-0-0). Thus, if base the memory bank number on P_oddP_even we only get two memory banks repeated twice, e.g. 00-01-00-01, instead of the desired mapping to four separate memory banks 00-01-10-11

~~Talk about what happens in final stage when s==nbits and nbits is odd.~~

Stage 0	Stage 1	Stage 2
d0=000	d0=000	d0=000
d1=001	d2=010	d4=100
d2=010	d4=100	d1=001
d3=011	d6=110	d5=101
d4=100	d1=001	d2=010
d5=101	d3=011	d6=110
d6=110	d5=101	d3=011
d7=111	d7=111	d7=111

https://skydrive.live.com/view.aspx?cid=8595D345F1A90105&resid=8595D345F1A90105%213371&app=WordPdf&wdo=1

Try this: for stage s=S-1, instead of

d_s = d_(S-1) = P_s = P_1 (because S is even and thus S-1 is odd)

and

d_(s+1) = d_0 = P_(s+1) = P_0,

do this maybe

d_s = d_(S-1) = xor(P_s,d_0) = xor(P_(S-1),d_0) = xor(P_0,d_0) (because S is odd and thus (S-1) is even) (also note xor(P_0,d_0) is same as P_0) (right?)

and

d_(s+1) = d_S = d_1 (why? because d_(S-1)==d_0 sorta?) (because

and

d_1 = xor(P_(s+1),d_0) = xor(P_1,d_0)

Final equation for when nbits is odd and R==2 and B==1

Algorithm 2: Final equation(s) for R=2 and B=1

Equation for mapping addresses to banks for one butterly (B==1) at radix 2 (r==2)

The ... REST ... of the story?

Projects: FFT_Formal_Proof_-_Auxiliary_Info#Sketching_out_the_remainder_of_the_proof

Extension to more than one butterfly unit (B > 1).

Generalization of Algorithm 1, extended for any B=2ⁱ and any number of address bits S: mod_MbitsS=0

Eventually we will find and show that, by extension, if we have

an FFT with B butterflies (and thus M=4B memory banks);
a transform with S=log₂(N) stages such that mod_Mbits(S)=0, where Mbits=log₂ (M)
(e.g. B=1⇒M=4⇒Mbits=2 and mod_Mbits(S)=mod₂(S)=0⇒number of stages S is even);

everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above

The mapping of any given datapoint

d =d_S-1d_S-2...d₁d₀

to its proper memory bank

m=m_Mbits-1m_Mbits-2...m₁m₀

is as follows:

m_Mbits-1 = P_{(Mbits-1)/Mbits}
m_Mbits-2 = P_{(Mbits-2)/Mbits}
m₁ = P_1/Mbits
m₀ = P_0/Mbits

where

P_{Mbits-1/Mbits} = Parity(d_S-1,d_S-Mbits-1...d_2Mbits-1,d_Mbits-1)

is the parity of all those bits d_i such that i mod Mbits = (Mbits-1); and

P_{Mbits-2/Mbits} = Parity(d_S-2,d_S-Mbits-2...d_2Mbits-2,d_Mbits-2)

is the parity of all those bits d_i such that i mod Mbits = (Mbits-2); and

P_1/Mbits = Parity(d_S-(Mbits-1),d_S-(2Mbits-1)...d_Mbits+1,d₁)

is the parity of all those bits d_i such that i mod Mbits = 1; and

P_0/Mbits = Parity(d_S-(Mbits),d_S-(2Mbits)...d_Mbits,d₀)

is the parity of all those bits d_i such that i mod Mbits = 0; and The equations thus work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.

Extension to radix R > 2.

Extension to longer pipelines.

Nomenclature (Symbols)

We will try to use symbols consistently throughout this writeup, as described by this alphabetically-ordered reference table.

Italicized symbols are numbers, non-italicized symbols are names.

lower-case symbols are variables, UPPER-CASE symbols are constant (for a given FFT).

Symbol	Meaning	Notes
B	How many butterfly units	To complete a transform for a given number of datapoints N>=8, an FFT with one butterfly unit (B=1) generally takes four times as long as an FFT with four butterfly units (B=4).
d_i	Datapoint bit number	d₀ is always the least-significant bit (LSB). Example: datapoint decimal d13 has four bits d₃d₂d₁d₀ = 1101
di	A specific datapoint	An eight-point transform has datapoints d0 through d7
i	Unknown integer	We usually start numbering from i=0
M	How many memory banks	For purposes of this proof M=4B
m	Memory bank	Generally need four memory banks per butterfly e.g. m=0 through m=7 for two butterflies (B=2).
mi	A specific memory bank	Generally need four memory banks per butterfly e.g. m0 through m7 for two butterflies (B=2).
m_i	Memory bank bit number	Generally need four memory banks per butterfly e.g. m₂m₁m₀={000,001 ... 111} for two butterflies (B=2).
Mbits	How many bits in each mem bank number m	Mbits = log₂(M)
N	Number of datapoints to transform	A power of two, e.g. 8, 16, 1024 or 8096
P_o or Po	Parity of odd bits	Example: decimal 13 has four bits d₃d₂d₁d₀ = 1101; thus P_o is parity(d₃,d₁) = parity(1,0) = 1
P_e or Pe	Parity of even bits	Example: decimal 13 has four bits d₃d₂d₁d₀ = 1101; thus P_o is parity(d₂,d₀) = parity(1,1) = 0
R	Radix	Default is 2
S	Number of stages in a transform	An eight-point transform has three stages, so S=3
s	Stage number	An eight-point transform has three stages s=0, s=1 and s=2.

~~For technical reasons (laziness actually), italics and subscripts might occasionally be dropped, e.g. you'll see d0 instead of d₀ etc. I'll hope to fix this later...~~

Notes/TODO

Nomenclature section should be referenced way up front at the beginning.

OLD: Establish meaning of symbols s (stage number), n_s (how many stages), b_i (bit number i, where b_0 is LSB and b_S is MSB maybe), n_b, r, n_B etc.

Housekeeping/Schedule (or: how/when I'm gonna be done with this)

Wanna be done by FRIIIIIDAYYYYYY! Not gonna happen.

What will be done by Friday:

checkoff list for what will be done when (see below)

Checkoff list:

   Target     Actual
   Finish     Finish    Task
  ---------  ---------  ----------------------------------------------
  Tue 11/19  Tue 11/19  Move the old stuff to an auxiliary page. DONE!
  ...
  Tue 11/19  Tue 11/19  Clean up the outline, figure out where to put
                        the "nomenclature" section (see below)
  ...
  Tue 11/19  xxx xx/xx  Build a timeline for what gets done when, 
                        working like twelve hours a week or whatever.
  ...
  Tue 11/19  Tue 11/19  Start on nomenclature section: what is N, n, N_B etc. 
  ...
  Thu 11/19  Tue 11/19  Finish nomenclature section: what is N, n, N_B etc. 
  ...
  Thu 11/19  xxx xx/xx  Continue with the proof, in order of sections not finished
  ...
  TBD List of sections, including when each one will be done
  ...
  xxx xx/xx  xxx xx/xx  Section 5
  xxx xx/xx  xxx xx/xx    5.1 The equation so far
  xxx xx/xx  xxx xx/xx    5.2 Boundary condition, odd number of bits
  xxx xx/xx  xxx xx/xx    5.3 Boundary condition, even number of bits
  xxx xx/xx  xxx xx/xx    5.4 Final equation for r==2 and B==1
  xxx xx/xx  xxx xx/xx    Final polish of section 5
  ...
  xxx xx/xx  xxx xx/xx  Section 6
  xxx xx/xx  xxx xx/xx    6.1 Extension to more than one butterfly unit (B > 1).
  xxx xx/xx  xxx xx/xx    6.2 Extension to radix r > 2.
  xxx xx/xx  xxx xx/xx    6.3 Extension to longer pipelines.

Housekeeping/TODO list

TODO: LaTeX

Later: replace html page with a latex page for publication.

TODO: Notification list

How to sign up for notification of when the proof is finished.

Send me mail, I'll put you on a list and notify you when the proof is done (link to profile page on vlsiwebsite). Or maybe look at the schedule (link to schedule section above/below).

TODO: Nomenclature

Go back and recheck entire proof, see if it is consistent with nomenclature section 9.

TODO: FFT Generator

Make sure FFT generator (correctly) uses algorithm described in the formal proof!

Notes

Hand-written notes where the proof originally was developed:

https://vlsiweb.stanford.edu/mediawiki/index.php/File:Formal_proof_notes.pdf

That algorithm as implemented in the Perl test program:

############################################################################
# E.g. for 8 datapoints
#   stage 0 ($op1,$op2) will be {(0,1),(2,3),(4,5),(6,7)} maybe
#   stage 1 ($op1,$op2) will be {(0,2),(4,6),(1,3),(5,7)} maybe
#   stage 2 ($op1,$op2) will be {(0,4),(1,5),(2,6),(3,7)} maybe

# Normal treatment for stages 0 through nstages-2
if ($s != ($nstages-1)) {
    $op1 = smart_replace($op1, $s);
    $op1 = smart_replace($op1, $s+1);

    $op2 = smart_replace($op2, $s);
    $op2 = smart_replace($op2, $s+1);
}

# Last stage nstages-1 needs special treatment.
elsif ($s%2 == 1) {

    # Even number of bits; use zero instead of s+1 maybe.
    $op1 = smart_replace($op1, $s);
    $op1 = smart_replace($op1, 0);

    $op2 = smart_replace($op2, $s);
    $op2 = smart_replace($op2, 0);
}
else {
    # Odd number of bits; gotta do sumpm crazy maybe.
    $op1 = weird_replace($op1, $s);
    $op2 = weird_replace($op2, $s);
}

# Swizzle to assign correct bank & row to $op1, $op2
my ($b1, $r1) = swizzler::swizzle($op1, $npoints, 4*$nunits);
my ($b2, $r2) = swizzler::swizzle($op2, $npoints, 4*$nunits);

##############################################################################
sub smart_replace {
    # In the given number $n, replace bit $i with new bit $b
    # If $i is  odd, new bit $b is the parity of the  odd bits of $n
    # If $i is even, new bit $b is the parity of the even bits of $n
    #
    my $n = shift;
    my $i = shift;

    my $newbit = $i%2 ? parity_odd($n) : parity_even($n);
    $n = replace_bit($n, $i, $newbit);    # Put $newbit into $n at position $i
    return $n;
}

sub weird_replace {
    # Given number $n and ultimate stage $s, where $s is even, do some weird $#!?
    my $n = shift;
    my $s = shift;

    # This is supposed to work at least for when nbits(datapoints) is odd.  i think.
    my $b0 = ($n & 0x1);
    my $b1 = ($n & 0x2) >> 1;

    my $new_top_bit = parity_even($n) ^ $b1;
    my $new_b1 = parity_odd($n) ^ $b1 ^ $b0;
    my $new_b0 = $b1 ^ $b0;  # Complicated enough yet!?

    $n = replace_bit($n,$s,$new_top_bit); # Put $new_top_bit into $n at position $s
    $n = replace_bit($n, 1,$new_b1);      # Put $new_b1 into $n at position 1
    $n = replace_bit($n, 0,$new_b0);      # Put $new_b0 into $n at position 0 (LSB right?)

    return $n;
}

Other Notes, Trash and Auxiliary Info

Projects: FFT Formal Proof v0 (old)

Projects: FFT Formal Proof - Auxiliary Info

Projects: FFT_Formal_Proof_-_Auxiliary_Info#Trash

Developing the basic equation for one radix-2 butterfly (R==2 and B==1)

The equation so far

[Summarize]

NEXT: ADD LINK TO TAVALA PAPER HERE FOR REFERENCE

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1205957

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1205957&tag=1

http://genesis2.stanford.edu/mediawiki/index.php/File:Takala_equations.jpg

Example

**An example table**
Datapoint	Address (binary)	P₁,P₀	bank
d00 d01 d02 d03	0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d04 d05 d06 d07	0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1	0,1 0,0 1,1 1,0	m1 m0 m3 m2
d08 d09 d10 d11	1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1	1,1 1,0 0,1 0,0	m3 m2 m1 m0
d12 d13 d14 d15	1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1	1,0 1,1 0,0 0,1	m2 m3 m0 m1

another trash block

where

if (s==S-1) then d'_0 = P_0 and d'_s = P_1 and all other d'_j = d_j

else d'_s = P_s and d'_(s+1) = P_(s+1) and all other d'_j = d_j

another trash block

Given a transform with an even number of stages S, that is, S=log₂(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S); for one butterfly (B=1) we should have four memory banks (M=4) numbered 00 to 11 i.e. m₁m₀={00,01,10,00}. The mapping of any given datapoint

~~d=d_S-1d_S-2...d₁d₀~~

to its proper memory bank
m=m₁m₀
is as follows:

m₁ = P₁ = P_odd

m₀ = P₀ = P_even

~~The equations thus work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.~~

need to define P_i for the range i=0 to S and number of bits N
then we can say

~~where~~

if (j==s) then d'_j = P_j

else d'_j = d_j
~~and P_j is defined as~~
if (j==S-1) then P_j = P_S-1
d_i = P_s = P₁ = P_odd and d₀ = P_s = P₀ = P_even
dp'_i = d'_S-1d'_S-2...d'₁d'₀
where

if s==i and s==S-1 then i is odd and d_i = P_s = P₁ = P_odd and d₀ = P_s = P₀ = P_even

else if s==i and s!=S-1) then d_i = P_s and d_i+1 = P's+1

Projects: FFT Formal Proof v0 (old) - StanfordVLSI/Genesis2 GitHub Wiki

Table of Contents

Major caveat(s)

The Problem

Key Insights

Insight 1: We can ensure that each of the four memory banks m0, m1, m2 and m3 are represented in every quad, in some order

Insight 2: We can reorder the quads such that each one contains the four memory banks m0, m1, m2 and m3 in a consistent order

Boundary condition, even number of bits

Algorithm 1: One radix-2 butterfly (R==2 and B==1) and even number of address bits

SCRUBBED TO HERE

Example mapping for stage s=0 of a transform for N=16 datapoints

TODO/NEXT/CONTINUE FROM HERE -------------------------------------------

Boundary condition, odd number of bits

Final equation for when nbits is odd and R==2 and B==1

Algorithm 2: Final equation(s) for R=2 and B=1

The ... REST ... of the story?

Extension to more than one butterfly unit (B > 1).

Generalization of Algorithm 1, extended for any B=2ⁱ and any number of address bits S: mod_MbitsS=0

everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above

Extension to radix R > 2.

Extension to longer pipelines.

Nomenclature (Symbols)

Notes/TODO

Housekeeping/Schedule (or: how/when I'm gonna be done with this)

Housekeeping/TODO list

TODO: LaTeX

TODO: Notification list

TODO: Nomenclature

TODO: FFT Generator

Notes

Other Notes, Trash and Auxiliary Info

Developing the basic equation for one radix-2 butterfly (R==2 and B==1)

The equation so far

Example

another trash block

another trash block

⚠️ GitHub.com Fallback ⚠️

Projects: FFT Formal Proof v0 (old) - StanfordVLSI/Genesis2 GitHub Wiki

Table of Contents

Major caveat(s)

The Problem

Key Insights

Insight 1: We can ensure that each of the four memory banks m0, m1, m2 and m3 are represented in every quad, in some order

Insight 2: We can reorder the quads such that each one contains the four memory banks m0, m1, m2 and m3 in a consistent order

Boundary condition, even number of bits

Algorithm 1: One radix-2 butterfly (R==2 and B==1) and even number of address bits

SCRUBBED TO HERE

Example mapping for stage s=0 of a transform for N=16 datapoints

TODO/NEXT/CONTINUE FROM HERE -------------------------------------------

Boundary condition, odd number of bits

Final equation for when nbits is odd and R==2 and B==1

Algorithm 2: Final equation(s) for R=2 and B=1

The ... REST ... of the story?

Extension to more than one butterfly unit (B > 1).

Generalization of Algorithm 1, extended for any B=2i and any number of address bits S: modMbitsS=0

everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above

Extension to radix R > 2.

Extension to longer pipelines.

Nomenclature (Symbols)

Notes/TODO

Housekeeping/Schedule (or: how/when I'm gonna be done with this)

Housekeeping/TODO list

TODO: LaTeX

TODO: Notification list

TODO: Nomenclature

TODO: FFT Generator

Notes

Other Notes, Trash and Auxiliary Info

Developing the basic equation for one radix-2 butterfly (R==2 and B==1)

The equation so far

Example

another trash block

another trash block

⚠️ **GitHub.com Fallback** ⚠️

Generalization of Algorithm 1, extended for any B=2ⁱ and any number of address bits S: mod_MbitsS=0

⚠️ GitHub.com Fallback ⚠️