Projects: FFT Formal Proof v0 (old) - StanfordVLSI/Genesis2 GitHub Wiki

Table of Contents

Major caveat(s)

Caveat 1: The proof is obviously unfinished. I believe there is enough information in what's been written so far to properly construct the proof, but clearly I haven't yet made the time to do that myself. But I'm actively working on it! Come back later if you want to see the complete final proof. Or send me mail and I can notify you when it's done.

Caveat 2: The Stanford FFT generator, which I authored, may not yet actually use the algorithm that results directly from this proof. But I plan to do that eventually...!

The Problem

A detailed description of the problem can be found in the 2013 VLSI-SoC paper

  • An area-efficient minimum-time FFT schedule using single-ported memory, Richardson, Shacham, Markovic and Horowitz. (To appear in) Proceedings of the 21st IFIP/IEEE International Conference on VLSI (VLSI-SoC), 2013.
To summarize the problem here: take a standard schedule of operations for a Fast Fourier Transform processor (FFT) based on a radix-2 Cooley-Tukey algorithm, e.g. for N=8 datapoints and one butterfly unit, as shown in Figure 1. FIGURE 1

That is, suppose we have eight datapoints d0 through d7 that we want to transform from the time domain to the frequency domain. In Stage 0 of operation, a radix-2 butterfly unit reads datapoints d0 and d1, producing two results that then overwrite the original locations of d0 and d1 in memory. Similar operations occur for datapoint pairs (d2,d3), (d4,d5) and (d6,d7) respectively. In Stage 1, the datapoints are processed with stride-two pairings e.g. (d0,d2), (d4,d6), (d1,d3) and (d5,d7). And finally Stage 2 completes the operation, re-walking the data by stride 4. Table 1 summarizes this access pattern.

 TABLE 1
 Stage 0: (d0,d1) (d2,d3) (d4,d5) (d6,d7)
 Stage 1: (d0,d2) (d4,d6) (d1,d3) (d5,d7)
 Stage 2: (d0,d4) (d1,d5) (d2,d6) (d3,d7)

If the butterfly unit is pipelined such that result writebacks occur at the same time as operand reads, then datapoints (d0,d1) get written to memory at the same time as (d2,d3) are being read from memory. For memory to be able to supply all four datapoints at once, it needs to be four-ported. Alternatively, we could try to use two separate banks of two-ported memory or even four banks of single-ported memory. Of these three choices, four banks of single-ported memory is most efficient in terms of area, so we'd like to make that work if possible.

Now: Within any stage, look at any sequential group of four operands starting on a mod-two boundary. We'll call such a group a quad. If the quad happens to start on a mod-four boundary, we'll call this an aligned quad. Stage 0 has three overlapping quads: (d0,d1,d2,d3), (d2,d3,d4,d5) and (d4,d5,d6,d7), the outside two of which are aligned quads. Our goal is to map each datapoint to one of four memory banks (m0,m1,m2,m3) such that every quad contains exactly one of each bank number. For example, we could map datapoints (d0,d1,d2,d3,d4,d5,d6,d7) to memory banks (m0,m1,m2,m3,m0,m1,m2,m3) respectively, and thus fulfill our goal for each of the three quads in Stage 0:

  • (d0,d1,d2,d3) maps to (m0,m1,m2,m3)
  • (d2,d3,d4,d5) maps to (m2,m3,m0,m1)
  • (d4,d5,d6,d7) maps to (m0,m1,m2,m3)
The first quad in Stage 1, however, fails because its datapoints (d0,d2,d4,d6) would map to (m0,m2,m0,m2). Banks m1 and m3 are not represented in the quad, while m0 and m2 each appear twice.

We're going to fulfill our goal by enforcing an order to each aligned quad, such that every aligned quad within a stage has the same memory bank ordering e.g. (m0,m1,m2,m3). This means that every unaligned quad will also have a consistent ordering, e.g. (m2,m3,m1,m0), and thus we meet our goal of uniqueness for all quads. This property will hold whether the transform is operating on eight datapoints, with two aligned quads per stage, or eight thousand datapoints, with two thousand quads per stage.

In theory the four operations in a given stage all happen simultaneously. In practice, it's not always practical to have enough butterfly units for that. So for instance, if we only have one butterfly unit, it will have to be used four times in succession to complete the transformation for a given stage, using the following four operations:

  1. read, process and writeback datapoints d0 and d1;
  2. read, process and writeback datapoints d2 and d3;
  3. read, process and writeback datapoints d4 and d5;
  4. read, process and writeback datapoints d6 and d7.
In pipelined operation, the writeback of each previous data pair occurs at the same time as the read of each new pair of datapoints. Thus, for example datapoints (d0,d1,d2,d3) will all have to be accessible at the same time within the same cycle, as will the datapoints in every quad. We want to use single-ported memory for the datapoints, thus our requirement that the datapoints in each quad must all live in separate memory banks (and assuming, of course, that each memory bank has its own separate path to the butterfly unit).

Key Insights

Insight 1: We can ensure that each of the four memory banks m0, m1, m2 and m3 are represented in every quad, in some order

Once again, here is our schedule for N=8 datapoints:

  • Stage 0:  (d0,d1)  (d2,d3)  (d4,d5)  (d6,d7)
  • Stage 1:  (d0,d2)  (d4,d6)  (d1,d3)  (d5,d7)
  • Stage 2:  (d0,d4)  (d1,d5)  (d2,d6)  (d3,d7)
Looking carefully at the quads, we see that they all share a unique property. For example, consider the quad represented by the four operands in the last two operations of Stage 1: (d1,d3),(d5,d7). Looking at the binary representation of the indices within any such group of four, only two bits change, the bits ds+1 and ds where s is the stage number and bit d0 is the least significant bit, or LSB. We call these two bits the toggling bits, and they always count 00, 01, 10, 11 within an aligned quad. In our example, the binary indices for (d1,d3),(d5,d7) are (00001,00011),(00101,00111). The toggling bits d2 and d1 count (00,01),(10,11) while the non-toggling bits d4d3d0 are always 001.

Given this information, we can assign banks to operands such that banknum = PoPe where Po is the parity of ALL odd bits and Pe is the parity of ALL even bits. This means that any given sequential block of four operands will map to four separate banks.

In our example quad [(d1,d3),(d5,d7)], or [(00001,00011),(00101,00111)]

  • d1: Po(00001) = parity(0,0) = 0; Pe(00001) = parity(0,0,1) = 1; PoPe = 01; banknum m = 1
  • d3: Po(00011) = parity(0,1) = 1; Pe(00011) = parity(0,0,1) = 1; PoPe = 11; banknum m = 3
  • d5: Po(00101) = parity(0,0) = 0; Pe(00101) = parity(0,1,1) = 0; PoPe = 00; banknum m = 0
  • d7: Po(00111) = parity(0,1) = 1; Pe(00111) = parity(0,1,1) = 0; PoPe = 10; banknum m = 2

Insight 2: We can reorder the quads such that each one contains the four memory banks m0, m1, m2 and m3 in a consistent order

When the stage number s is even, then Pe within any group of four will be either (0,1,0,1) or (1,0,1,0) depending on the (constant 0 or 1) parity of the non-toggling bits. Why? Because we know that bit ds within any group of four always counts (0,1,0,1) while the remaining (non-toggling) even bits remain constant.

Similarly, Po will be either (0,0,1,1) or (1,1,0,0). Bank number PoPe for a quad can thus be one of the following four groups: (00,01,10,11); (01,00,11,10); (10,11,00,01); or (11,10,01,00).

When the stage number is odd, we know that b(s) is an odd bit and thus it is Po that will be either (0,1,0,1) or (1,0,1,0) while Pe is (0,0,1,1) or (1,1,0,0). Bank number PoPe for a quad will thus be one of: (00,10,01,11); (01,11,00,10); (10,00,11,01); or (11,01,10,00).

Suppose, that, when stage number s is even and when Pe is going to count (1,0,1,0), we change ds (a component of Pe because s is even) such that it counts (1,0,1,0) instead of (0,1,0,1). In other words, instead of (d1,d3,d5,d7), where d1 goes (0,1,0,1) we change the sequence to (d5,d7,d1,d3), so that d1 goes (1,0,1,0). Is it okay to change the sequence like this? Yes it is because this is a valid reordering, as discussed above. And now Pe for the new ordering counts (0,1,0,1) instead of the original (1,0,1,0).

Further suppose that, when stage number s is even and when Po is about to count (1,1,0,0), we change ds+1 (a component of Po, because we know s+1 is odd) such that it counts (1,1,0,0) instead of (0,0,1,1). (In our example sequence (d1,d3,d5,d7) we would not need to make this change, because Po already counts the way we want it to.) In general, is it okay to change the sequence like this when needed? Yes it is because within any given group of four, the order of operations is unimportant; also the order of pairs within a given operation is unimportant. In other words, looking again at the last four operands in Stage 1 of our example above, the given order is (d1,d3)(d5,d7) but all EIGHT of the following possible sequences are equally valid:

 * (d1,d3)(d5,d7);  (d1,d3)(d7,d5);  (d3,d1)(d5,d7);  (d3,d1)(d7,d5); (d5,d7)(d1,d3);  (d5,d7)(d3,d1);  (d7,d5)(d1,d3);  (d7,d5)(d3,d1).

Let's define Pi such that

  • Pi = Pe when i is even, and
  • Pi = Po when i is odd
Now we see that WE CAN ALWAYS REPLACE ds+1 AND ds WITH Ps+1 AND Ps RESPECTIVELY TO CANONICALIZE THE BANK ORDER. In particular, given this transformation, banks in an aligned even-stage quad will always count (00,01,10,11) and the banks in an aligned odd-stage quad will always count (00,10,01,11).

In our example, (d1,d3,d5,d7) is part of an odd stage; Po=(0,1,0,1) and Pe=(1,1,0,0). Because Pe counts incorrectly (1,1,0,0) instead of (0,0,1,1), we replace b(s+1) with Pe to come up with a new ordering (d5,d7,d1,d3). Now Po=(0,1,0,1) and Pe=(0,0,1,1) and final bank number assignments read (00,10,01,11), which is the correct canonical order for an odd stage.

Does this make sense yet?

This is the core of the proof. However, we still need to place it in a formal context. Also, we must discuss:

  • the special cases for the boundary condition when s and s+1 wrap around at the end, that is when s == S-1.
  • how the proof extends to any power-of two values for B, N or r maybe;
  • and, for advanced work, we should mention how the proof extends to any pipeline depth.

Boundary condition, even number of bits

Note that our insight regarding quads changes slightly in the final stage of operation, when s = S-1. In all previous stages up to that point, the toggling bits are adjacent (see example below).

Stage 0 Stage 1 Stage 2 Stage 3




d0=0000 d0=0000 d0=0000 d0=0000
d1=0001 d2=0010 d2=0100 d4=1000
d2=0010 d4=0100 d4=1000 d1=0001
d3=0011 d6=0110 d6=1100 d5=1001




d4=0100 d1=0001 d1=0001 d2=0010
d5=0101 d3=0011 d3=0101 d6=1010
d6=0110 d5=0101 d5=1001 d3=0011
d7=0111 d7=0111 d7=1101 d7=1011




... ... ... ...
d7=1110 d7=1101 d7=1011 d7=0111
d7=1111 d7=1111 d7=1111 d7=1111

At the final stage s=S-1, however, the toggling bits wrap around, such that the toggling bits are the LSB and the MSB respectively, i.e. bits d0 and ds instead of the usual bits ds+1 and ds (because in this case ds+1 does not exist).

This wrap works fine if the number of bits is even, in which case d0 is naturally part of Pe and ds part of Po, same as in the case where the toggling bits are adjacent. Later we will discuss the case where then number of bits is odd.

Algorithm 1: One radix-2 butterfly (R==2 and B==1) and even number of address bits

Hmm. The derivation below is WRONG. It should go more like this:

Given

  • a transform with N datapoints such that N is a power of two and N β‰₯ 16; and
  • an even number of stages S, that is, S=log2(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).
For the algorithm to work correctly for one butterfly (B=1), we should have four memory banks (M=4) numbered 002 to 112 i.e. m1m0={002,012,102,112} i.e. m0=002, m1=012, m2=102 and m3=112.

The following equation(s) transform a given stage's datapoint sequence

(dp(N-1) dp(N-2)...dp1,dp0)
into an equivalent sequence
(dp'(N-1) dp'(N-2)...dp'1,dp'0)
such that each datapoint dp'i maps to a memory bank mbi where the corresponding memory bank sequence
(mb'(N-1) mb'(N-2)...mb'0,mb'1)
is the repeating series
(m0,m1,m2,m3, m0,m1,m2,m3 ... m0,m1,m2,m3),
where m0 = memory bank 002, m1 = memory bank 012, m2 = memory bank 102 and m3 = memory bank 112.

So:

For each S-bit datapoint dps,i in a given transform stage s βˆˆ (0..S-1),

dps,i = dS-1dS-2...d1d0
and i βˆˆ (0..N-1), redefine it as a new datapoint
dp's,i = d'S-1d'S-2...d'1d'0
where
    if (s==S-1) then      d'0 = P0    and d's = P1    and all other d'j = dj , where j ∈ (1..S-2)
    else      d's = Ps    and d'(s+1) = P(s+1)    and all other d'j = dj , where j ∈ (0..S-1) and j βˆ‰ {s,s+1}

and

    Pk = P0 = Peven    if k is even and
    Pk = P1 = Podd    if k is odd

and

  • P0 = Peven = Parity(dS-2,dS-4...d2,d0)
is the parity of the datapoint's even bits, that is all those bits dj such that j mod 2 = 0; and
  • P1 = Podd = Parity(dS-1,dS-3...d3,d1)
is the parity of the datapoint's odd bits, that is all those bits dj such that j mod 2 = 1.

The equations work for

  • a transform with N datapoints such that N is a power of two and N β‰₯ 16; and
  • an even number of stages S, that is, S=log2(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).
That is, they work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.

SCRUBBED TO HERE


Example mapping for stage s=0 of a transform for N=16 datapoints

Example mapping for stage s=0 of a transform for N=16 datapoints
   Datapoint d   
(decimal)
   Address (binary)
= d3d2d1d0   
   P1,P0       Datapoint d' =
Replace ds+1ds (d1d0)
with P1P0 (reorder)
   P'1,P'0       bank = P'1P'0   
d00
d01
d02
d03
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0,0
0,1
1,0
1,1
0 0 0 0 = d00
0 0 0 1 = d01
0 0 1 0 = d02
0 0 1 1 = d03
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d04
d05
d06
d07
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
0,1
0,0
1,1
1,0
0 1 0 1 = d05
0 1 0 0 = d04
0 1 1 1 = d07
0 1 1 0 = d06
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d08
d09
d10
d11
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1,1
1,0
0,1
0,0
1 0 1 1 = d11
1 0 1 0 = d10
1 0 0 1 = d09
1 0 0 0 = d08
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d12
d13
d14
d15
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
1,0
1,1
0,0
0,1
1 1 1 0 = d14
1 1 1 1 = d15
1 1 0 0 = d12
1 1 0 1 = d13
0,0
0,1
1,0
1,1
m0
m1
m2
m3

TODO/NEXT/CONTINUE FROM HERE -------------------------------------------

Boundary condition, odd number of bits

If the number of bits is odd, we have a slight problem. Everything is fine as before from stage 0 up through the penultimate stage S-2. But then things change for the final stage S-1, where the toggling bits are the two outside bits dS-1 (fast toggle 0,1,0,1) and d0 (slow toggle 0,0,1,1). In the previous case, where there were an even number of address bits, we knew that dS-1 was odd and d0 was even, and thus Podd would toggle 0,1,0,1 (or 1,0,1,0, depending on the value of the non-toggling bits) while Peven would toggle 0,0,1,1 (or 1,1,0,0).

In the case where the number of bits is odd, however, the two outside (toggling) bits dS-1 and d0 are both even. Thus, as d0dS-1 count (00,01,10,11), Podd does not change at all (because the toggling bits are both even) while Peven counts either (0,1,1,0) (or (1,0,0,1), depending on the value of the non-toggling bits) instead of its usual 0-0-1-1 (or 1-1-0-0). Thus, if base the memory bank number on PoddPeven we only get two memory banks repeated twice, e.g. 00-01-00-01, instead of the desired mapping to four separate memory banks 00-01-10-11

Talk about what happens in final stage when s==nbits and nbits is odd.

Stage 0      Stage 1      Stage 2     
d0=000 d0=000 d0=000
d1=001 d2=010 d4=100
d2=010 d4=100 d1=001
d3=011 d6=110 d5=101
d4=100 d1=001 d2=010
d5=101 d3=011 d6=110
d6=110 d5=101 d3=011
d7=111 d7=111 d7=111

https://skydrive.live.com/view.aspx?cid=8595D345F1A90105&resid=8595D345F1A90105%213371&app=WordPdf&wdo=1

Try this: for stage s=S-1, instead of

  • d_s = d_(S-1) = P_s = P_1 (because S is even and thus S-1 is odd)
and
  • d_(s+1) = d_0 = P_(s+1) = P_0,

do this maybe
  • d_s = d_(S-1) = xor(P_s,d_0) = xor(P_(S-1),d_0) = xor(P_0,d_0) (because S is odd and thus (S-1) is even) (also note xor(P_0,d_0) is same as P_0) (right?)
and
  • d_(s+1) = d_S = d_1 (why? because d_(S-1)==d_0 sorta?) (because
and
  • d_1 = xor(P_(s+1),d_0) = xor(P_1,d_0)

Final equation for when nbits is odd and R==2 and B==1

Algorithm 2: Final equation(s) for R=2 and B=1

Equation for mapping addresses to banks for one butterly (B==1) at radix 2 (r==2)

The ... REST ... of the story?

Projects: FFT_Formal_Proof_-_Auxiliary_Info#Sketching_out_the_remainder_of_the_proof

Extension to more than one butterfly unit (B > 1).

Generalization of Algorithm 1, extended for any B=2i and any number of address bits S: modMbitsS=0

Eventually we will find and show that, by extension, if we have

  • an FFT with B butterflies (and thus M=4B memory banks);
  • a transform with S=log2(N) stages such that modMbits(S)=0, where Mbits=log2 (M)
  • (e.g. B=1β‡’M=4β‡’Mbits=2 and modMbits(S)=mod2(S)=0β‡’number of stages S is even);

everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above

The mapping of any given datapoint

d =dS-1dS-2...d1d0
to its proper memory bank
m=mMbits-1mMbits-2...m1m0
is as follows:
  • mMbits-1 = P(Mbits-1)/Mbits
  • mMbits-2 = P(Mbits-2)/Mbits
  • m1 = P1/Mbits
  • m0 = P0/Mbits
where
  • PMbits-1/Mbits = Parity(dS-1,dS-Mbits-1...d2Mbits-1,dMbits-1)
is the parity of all those bits di such that i mod Mbits = (Mbits-1); and
  • PMbits-2/Mbits = Parity(dS-2,dS-Mbits-2...d2Mbits-2,dMbits-2)
is the parity of all those bits di such that i mod Mbits = (Mbits-2); and
  • P1/Mbits = Parity(dS-(Mbits-1),dS-(2Mbits-1)...dMbits+1,d1)
is the parity of all those bits di such that i mod Mbits = 1; and
  • P0/Mbits = Parity(dS-(Mbits),dS-(2Mbits)...dMbits,d0)
is the parity of all those bits di such that i mod Mbits = 0; and The equations thus work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.

Extension to radix R > 2.

Extension to longer pipelines.

Nomenclature (Symbols)

We will try to use symbols consistently throughout this writeup, as described by this alphabetically-ordered reference table.

Italicized symbols are numbers, non-italicized symbols are names.

lower-case symbols are variables, UPPER-CASE symbols are constant (for a given FFT).

Symbol Meaning Notes
B How many butterfly units To complete a transform for a given number of datapoints N>=8, an FFT with one butterfly unit
(B=1) generally takes four times as long as an FFT with four butterfly units (B=4).
di Datapoint bit number d0 is always the least-significant bit (LSB). Example: datapoint decimal d13 has four bits d3d2d1d0 = 1101   
di A specific datapoint An eight-point transform has datapoints d0 through d7
i Unknown integer We usually start numbering from i=0
M How many memory banks For purposes of this proof M=4B
m Memory bank Generally need four memory banks per butterfly e.g. m=0 through m=7 for two butterflies (B=2).
mi A specific memory bank Generally need four memory banks per butterfly e.g. m0 through m7 for two butterflies (B=2).
mi Memory bank bit number Generally need four memory banks per butterfly e.g. m2m1m0={000,001 ... 111} for two butterflies (B=2).
Mbits How many bits in each mem bank number m Mbits = log2(M)
N Number of datapoints to transform A power of two, e.g. 8, 16, 1024 or 8096
Po or Po Parity of odd bits Example: decimal 13 has four bits d3d2d1d0 = 1101; thus Po is parity(d3,d1) = parity(1,0) = 1
Pe or Pe Parity of even bits Example: decimal 13 has four bits d3d2d1d0 = 1101; thus Po is parity(d2,d0) = parity(1,1) = 0
R Radix Default is 2
S Number of stages in a transform An eight-point transform has three stages, so S=3
s Stage number An eight-point transform has three stages s=0, s=1 and s=2.

For technical reasons (laziness actually), italics and subscripts might occasionally be dropped, e.g. you'll see d0 instead of d0 etc. I'll hope to fix this later...

Notes/TODO

Nomenclature section should be referenced way up front at the beginning.

OLD: Establish meaning of symbols s (stage number), n_s (how many stages), b_i (bit number i, where b_0 is LSB and b_S is MSB maybe), n_b, r, n_B etc.

Housekeeping/Schedule (or: how/when I'm gonna be done with this)

Wanna be done by FRIIIIIDAYYYYYY! Not gonna happen.

What will be done by Friday:

  • checkoff list for what will be done when (see below)
Checkoff list:
   Target     Actual
   Finish     Finish    Task
  ---------  ---------  ----------------------------------------------
  Tue 11/19  Tue 11/19  Move the old stuff to an auxiliary page. DONE!
  ...
  Tue 11/19  Tue 11/19  Clean up the outline, figure out where to put
                        the "nomenclature" section (see below)
  ...
  Tue 11/19  xxx xx/xx  Build a timeline for what gets done when, 
                        working like twelve hours a week or whatever.
  ...
  Tue 11/19  Tue 11/19  Start on nomenclature section: what is N, n, N_B etc. 
  ...
  Thu 11/19  Tue 11/19  Finish nomenclature section: what is N, n, N_B etc. 
  ...
  Thu 11/19  xxx xx/xx  Continue with the proof, in order of sections not finished
  ...
  TBD List of sections, including when each one will be done
  ...
  xxx xx/xx  xxx xx/xx  Section 5
  xxx xx/xx  xxx xx/xx    5.1 The equation so far
  xxx xx/xx  xxx xx/xx    5.2 Boundary condition, odd number of bits
  xxx xx/xx  xxx xx/xx    5.3 Boundary condition, even number of bits
  xxx xx/xx  xxx xx/xx    5.4 Final equation for r==2 and B==1
  xxx xx/xx  xxx xx/xx    Final polish of section 5
  ...
  xxx xx/xx  xxx xx/xx  Section 6
  xxx xx/xx  xxx xx/xx    6.1 Extension to more than one butterfly unit (B > 1).
  xxx xx/xx  xxx xx/xx    6.2 Extension to radix r > 2.
  xxx xx/xx  xxx xx/xx    6.3 Extension to longer pipelines.

Housekeeping/TODO list

TODO: LaTeX

Later: replace html page with a latex page for publication.

TODO: Notification list

How to sign up for notification of when the proof is finished.

Send me mail, I'll put you on a list and notify you when the proof is done (link to profile page on vlsiwebsite). Or maybe look at the schedule (link to schedule section above/below).

TODO: Nomenclature

Go back and recheck entire proof, see if it is consistent with nomenclature section 9.

TODO: FFT Generator

Make sure FFT generator (correctly) uses algorithm described in the formal proof!

Notes

Hand-written notes where the proof originally was developed:

https://vlsiweb.stanford.edu/mediawiki/index.php/File:Formal_proof_notes.pdf

That algorithm as implemented in the Perl test program:

############################################################################
# E.g. for 8 datapoints
#   stage 0 ($op1,$op2) will be {(0,1),(2,3),(4,5),(6,7)} maybe
#   stage 1 ($op1,$op2) will be {(0,2),(4,6),(1,3),(5,7)} maybe
#   stage 2 ($op1,$op2) will be {(0,4),(1,5),(2,6),(3,7)} maybe

# Normal treatment for stages 0 through nstages-2
if ($s != ($nstages-1)) {
    $op1 = smart_replace($op1, $s);
    $op1 = smart_replace($op1, $s+1);

    $op2 = smart_replace($op2, $s);
    $op2 = smart_replace($op2, $s+1);
}

# Last stage nstages-1 needs special treatment.
elsif ($s%2 == 1) {

    # Even number of bits; use zero instead of s+1 maybe.
    $op1 = smart_replace($op1, $s);
    $op1 = smart_replace($op1, 0);

    $op2 = smart_replace($op2, $s);
    $op2 = smart_replace($op2, 0);
}
else {
    # Odd number of bits; gotta do sumpm crazy maybe.
    $op1 = weird_replace($op1, $s);
    $op2 = weird_replace($op2, $s);
}

# Swizzle to assign correct bank & row to $op1, $op2
my ($b1, $r1) = swizzler::swizzle($op1, $npoints, 4*$nunits);
my ($b2, $r2) = swizzler::swizzle($op2, $npoints, 4*$nunits);

##############################################################################
sub smart_replace {
    # In the given number $n, replace bit $i with new bit $b
    # If $i is  odd, new bit $b is the parity of the  odd bits of $n
    # If $i is even, new bit $b is the parity of the even bits of $n
    #
    my $n = shift;
    my $i = shift;

    my $newbit = $i%2 ? parity_odd($n) : parity_even($n);
    $n = replace_bit($n, $i, $newbit);    # Put $newbit into $n at position $i
    return $n;
}

sub weird_replace {
    # Given number $n and ultimate stage $s, where $s is even, do some weird $#!?
    my $n = shift;
    my $s = shift;

    # This is supposed to work at least for when nbits(datapoints) is odd.  i think.
    my $b0 = ($n & 0x1);
    my $b1 = ($n & 0x2) >> 1;

    my $new_top_bit = parity_even($n) ^ $b1;
    my $new_b1 = parity_odd($n) ^ $b1 ^ $b0;
    my $new_b0 = $b1 ^ $b0;  # Complicated enough yet!?

    $n = replace_bit($n,$s,$new_top_bit); # Put $new_top_bit into $n at position $s
    $n = replace_bit($n, 1,$new_b1);      # Put $new_b1 into $n at position 1
    $n = replace_bit($n, 0,$new_b0);      # Put $new_b0 into $n at position 0 (LSB right?)

    return $n;
}

Other Notes, Trash and Auxiliary Info

Projects: FFT Formal Proof v0 (old)

Projects: FFT Formal Proof - Auxiliary Info

Projects: FFT_Formal_Proof_-_Auxiliary_Info#Trash

Developing the basic equation for one radix-2 butterfly (R==2 and B==1)

The equation so far

[Summarize]

NEXT: ADD LINK TO TAVALA PAPER HERE FOR REFERENCE

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1205957

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1205957&tag=1

http://genesis2.stanford.edu/mediawiki/index.php/File:Takala_equations.jpg

Example

An example table
   Datapoint       Address (binary)       P1,P0       bank   
d00
d01
d02
d03
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0,0
0,1
1,0
1,1
m0
m1
m2
m3
d04
d05
d06
d07
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
0,1
0,0
1,1
1,0
m1
m0
m3
m2
d08
d09
d10
d11
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1,1
1,0
0,1
0,0
m3
m2
m1
m0
d12
d13
d14
d15
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
1,0
1,1
0,0
0,1
m2
m3
m0
m1

another trash block

where

if (s==S-1) then d'_0 = P_0 and d'_s = P_1 and all other d'_j = d_j
else d'_s = P_s and d'_(s+1) = P_(s+1) and all other d'_j = d_j

another trash block

Given a transform with an even number of stages S, that is, S=log2(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S); for one butterfly (B=1) we should have four memory banks (M=4) numbered 00 to 11 i.e. m1m0={00,01,10,00}. The mapping of any given datapoint

d=dS-1dS-2...d1d0
to its proper memory bank
m=m1m0
is as follows:
  • m1 = P1 = Podd
  • m0 = P0 = Peven

The equations thus work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.

need to define Pi for the range i=0 to S and number of bits N
then we can say

where

if (j==s) then d'j = Pj
else d'j = dj
and Pj is defined as
if (j==S-1) then Pj = PS-1
di = Ps = P1 = Podd and d0 = Ps = P0 = Peven
dp'i = d'S-1d'S-2...d'1d'0
where
if s==i and s==S-1 then i is odd and di = Ps = P1 = Podd and d0 = Ps = P0 = Peven
else if s==i and s!=S-1) then di = Ps and di+1 = P's+1
⚠️ **GitHub.com Fallback** ⚠️