Projects: FFT Formal Proof Auxiliary Info - StanfordVLSI/Genesis2 GitHub Wiki
Checkoff list:
Target Actual Finish Finish Task --------- --------- ---------------------------------------------- Tue 11/19 Tue 11/19 Move the old stuff to an auxiliary page. DONE! ... Tue 11/19 Tue 11/19 Clean up the outline, figure out where to put the "nomenclature" section (see below) ... Tue 11/19 xxx xx/xx Build a timeline for what gets done when, working like twelve hours a week or whatever. ... Tue 11/19 Tue 11/19 Start on nomenclature section: what is N, n, N_B etc. ... Thu 11/19 Tue 11/19 Finish nomenclature section: what is N, n, N_B etc. ... Thu 11/19 xxx xx/xx Continue with the proof, in order of sections not finished ... TBD List of sections, including when each one will be done ... xxx xx/xx xxx xx/xx Section 5 xxx xx/xx xxx xx/xx 5.1 The equation so far xxx xx/xx xxx xx/xx 5.2 Boundary condition, odd number of bits xxx xx/xx xxx xx/xx 5.3 Boundary condition, even number of bits xxx xx/xx xxx xx/xx 5.4 Final equation for r==2 and B==1 xxx xx/xx xxx xx/xx Final polish of section 5 ... xxx xx/xx xxx xx/xx Section 6 xxx xx/xx xxx xx/xx 6.1 Extension to more than one butterfly unit (B > 1). xxx xx/xx xxx xx/xx 6.2 Extension to radix r > 2. xxx xx/xx xxx xx/xx 6.3 Extension to longer pipelines.
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1205957
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1205957&tag=1
http://genesis2.stanford.edu/mediawiki/index.php/File:Takala_equations.jpg
Bank num m | Bits mn+1mnmn-1...m0 |
---|---|
m0 | mn+1mnmn-1...m0 = 00...002 = 0 |
m1 | mn+1mnmn-1...m1 = 00...012 = 1 |
β | β |
mn | mn+1mnmn-1...m0 = 11...102 = 4B-2 |
m(n+1) | mn+1mnmn-1...m0 = 11...112 = 4B-1 |
- bank m0: mn+1mnmn-1...m0 = 00...002
- bank m1: mn+1mnmn-1...m1 = 00...012
- ...
- bank mn: mn+1mnmn-1...m0 = 11...102
- bank m(n+1): mn+1mnmn-1...m0 = 11...112
- mn+1mnmn-1...m0={(00...002),(00...012), ... (11...102),(11...112)} i.e.
- m0=00..002, m1=00..012, mn=11..102 and m(n+1)=11..112.
Note this part of the proof is old and possibly unnecessary; all the key information is in the section currently labeled "The Problem," above.
Here is an algorithm for an FFT schedule free of interstage memory access conflicts along with an embedded formal proof of correctness. The proof is developed for an FFT implemented as a single time-sliced and pipelined radix-2 butterfly unit, e.g. one radix-two butterfly transforming eight datapoints. The proof will later be extended to cover any power-of-2 number of butterfly units at any power-of-2 radix, e.g. four radix-four butterflies transforming 1024 datapoints.
The initial proof posits an FFT based on the Cooley-Tukey [verify] algorithm, implemented as a single radix-2 butterfly. The butterfly unit repeatedly reads two input operands, calculates two results, then writes the results back to memory, repeating with new input operands until the FFT operation is complete. The butterfly unit is pipelined such that in continuous operation, the read of two new operands happens at the same time as the writeback of the two previous results. [maybe] This obviously requires a minimum of four memory accesses per cycle, using say a single four-ported memory bank, two dual-ported memory banks, or four independent single-ported memory banks. Our schedule targets the most efficient of these choices, four independent banks of single-ported memory.
Given the constraints outlined above, our schedule must place operands in memory and schedule the operand accesses such that any given group of four operands (two reads plus two writes) reside in four separate memory banks. As an example, consider the FFT schedule in Fig. 1 below, for N=8 datapoints. When implemented as a single time-sliced butterfly unit, the memory accesses occur as shown in Fig 2. For simplicity we assume that datapoint dp0 resides at memory location m0, dp1 at m1 and so on.
Fig. 1 goes here: std butterfly network for eight datapoints
Fig. 2: stretched-out version of Fig. 1 I guess
In cycle 0 of Fig. 1, the butterfly unit accesses datapoints from memory locations m0 and m1. In cycle 1, the butterfly unit writes its results back to m0 and m1 and reads new datapoints from m2 and m3. For this to happen without a stall, m0, m1, m2 and m3 must all reside in separate memory banks. Working through the entire diagram, we find the following nine groups of memory locations such that, during some cycle of execution, each memory location within the group gets accessed at the same time as each of the other three members of the group, and thus all four must live in separate banks for conflict-free operation.
{m0,m1,m2,m3} {m2,m3,m4,m5} {m4,m5,m6,m7} {m0,m2,m4,m6} {m4,m6,m1,m5} {m1,m5,m3,m7} {m0,m4,m1,m5} {m1,m5,m2,m6} {m2,m6,m3,m7}
So, for example, locations m0, m1, m2 and m3 can live in banks b0, b1, b2 and b3 respectively, thus satisfying the separateness requirements for group 1. In group 2, locations m2 and m3 have already been assigned to b2 and b3, and thus m4 and m5 must be assigned to b0 and b1. If we follow this through to group 3, we find that m6 and m7 have to reside in banks b2 and b3. By the time we get to group 4, we've already assigned m0, m2, m4 and m6 to b0, b2, b0 and b2, which is a problem because this causes collisions in both banks b0 and b2.
Finding the correct assignment, such that no collisions occur in any of the nine groups, is a very hard problem that may not be solvable for all values of N, much less for our little example where N=8. However, we can change the rules slightly. Note that the operations within a given FFT stage can occur in any order and still yield correct results. Thus in place of the hard-to-schedule network of Fig. 2 we could substitute a different ordering, such as that given by Fig. 3.
Fig. 3: tractable reordered schedule
Now the groups that must reside in separate banks are as follows:
m0 m1 m2 m3 m2 m3 m5 m4 m5 m4 m7 m6 m0 m2 m4 m6 m4 m6 m5 m1 m5 m1 m7 m3 m0 m4 m7 m3 m7 m3 m5 m1 m5 m1 m2 m6
Given this schedule, we can achieve our goal by placing datapoints m0 and m5 in bank b0, m1 and m4 in bank b1, m2 and m7 in bank b2, and m3 and m6 in bank b3:
[Below,]
A B C D C D A B A B C D m0 m1 m2 m3 m2 m3 m5 m4 m5 m4 m7 m6 A C B D B D A C A C B D m0 m2 m4 m6 m4 m6 m5 m1 m5 m1 m7 m3 A B C D C D A B A B C D m0 m4 m7 m3 m7 m3 m5 m1 m5 m1 m2 m6
[maybe]
The remainder of this proof will show how to deterministically achieve a schedule and a mapping for similar conflict-free operation given any number of datapoints N=2^i, where i is any integer greater than 2 (cases for i <= 2 are trivial and obvious).
The Cooley-Tukey algorithm [check] for sixteen datapoints yields the following schedule:
stage 0: 0 1 2 3 4 5 6 7 8 9 a b c d e f stage 1: 0 2 4 6 8 a c e 1 3 5 7 9 b d f stage 2: 0 4 8 c 1 5 9 d 2 6 a e 3 7 b f stage 3: 0 8 1 9 2 a 3 b 4 c 5 d 6 e 7 f
If we write these operands in binary form it looks something like this:
stage 0 stage 1 stage 2 stage 3 ------- ------- ------- ------- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Note that, in any given group of four within any given stage, only two bits toggle while the remaining bits stay the same. For the purposes of this discussion, we'll number the bits 0 to 3, right to left, where bit 0 is the least significant bit (LSB) and bit 3 is the most significant bit (MSB).
In stage 0, bits 0 and 1 toggle while bits 2 and 3 stay at 00 (in the first group of four), then 01 (in the second group), then 10 and then 11.
In stage 1 it is bits 1 and 2 that toggle while bits 3 and 0 remain constant within each group of four.
In general, for stage s, bits s+1 and s within any group of four will count 00,01,10,11 while the remaining bits s-1, s-2, ... increment only at group-of-four boundaries, where s-i wraps back to the MSB if s-i is less than 0. More precisely, given S stages [0..(S-1]], then within each group of four in stage s, bits
b[(s+1)] bs
count (00,01,10,11) while the remaining bits
b[(s+S)] b[(s+(S-1))] b[(s+(S-2))] ... b[(s+2)]
increment (0,1,10,11,100,101...) at each group-of-four boundary.
Our example illustrates this for N=16 datapoints, but the principle holds for any arbitrarily large number of datapoints. Bits s+1 and s count 00 to 11 while the remaining bits increment at the end of each group of four, counting up e.g. if N=1024 then the second group-of-four in stage 2 would be 0000001(00)0, 0000001(01)0, 0000001(10)0, 0000001(11)0.
Note that things get a little "funny" in the final stage of execution, when we run out of bits. In our example, at stage 3, it is not bits 3 and 4 that count up from 00 to 11 in each group of four (bit 4 doesn't exist); it is bits 3 and 0. That is, the toggling pair has wrapped back around to the beginning of the word. Things get even "funnier" when there is an odd number of bits, we'll talk about this later.
We're going to use the property we've just discovered to produce a conflict-free schedule. In particular, we're going to make sure that 1) within each group of four, the four operands will map to the four separate banks b0, b1, b2 and b3 in some order. And then 2) we're going to reorder the operations within each group of four such that bank accesses are strictly ordered within the group, namely b0 then b1 then b2 then b3 for even-numbered stages and b0 then b2 then b1 then b3 for odd-numbered stages(!)
Note that in any group of EIGHT operands, bits (s, s+1, s+2) count 000,001,010,011,100,101,110,111 while the remaining bits stay the same. Similarly for groups of 16, 32, 64, etc. etc. This allows us to extend our earlier proof in the following way...
Note that the formulation for a radix-four schedule is similar to that for two radix-2 butterflies operating in parallel. This is the basis for an obvious extension to any power-of-two radix greater than two etc. etc.
Yes it is because within any given group of four, the order of operations is unimportant; also the order of pairs within a given operation is unimportant. In other words, looking again at the last four operands in Stage 1 of our example above, the given order is (d1,d3)(d5,d7) but all EIGHT of the following possible sequences are equally valid:
- (d1,d3)(d5,d7); (d1,d3)(d7,d5); (d3,d1)(d5,d7); (d3,d1)(d7,d5);
- (d5,d7)(d1,d3); (d5,d7)(d3,d1); (d7,d5)(d1,d3); (d7,d5)(d3,d1).
When the stage number is odd, we know that ds is an odd bit and thus it is Pod that will be either (0,1,0,1) or (1,0,1,0) while Pev is (0,0,1,1) or (1,1,0,0). Bank number PodPev for a quad will thus be one of: (00,10,01,11); (01,11,00,10); (10,00,11,01); or (11,01,10,00).
Now let's conspire to make the memory banks count consistently (m0,m1,m2,m3, m0,m1,m2,m3,...) for even-numbered stages. For this to happen, Pev will need to consistently count (0,1,0,1, 0,1,0,1, ...) and Pod will need to consistently count (0,0,1,1, 0,0,1,1, ...) such that memory bank number PodPev will count (00,01,10,11, 00,01,10,11, ...)
If, however, we replace b(s) with Pev then we know that Pev will always count (0,1,0,1) (why?)
in particular if po/pe==00 then the bank order will be (00,01,10,11); if pope==01 the order will be (01,00,11,10); if pope==10 the order will be (10,11,00,01); and if pope==11 the order will be (11,10,01,00).
IF WE REPLACE BS+1/BS WITH POPE WE ALWAYS GET THE ORDER (00,01,10,11) (WHY?)
Note that IN ALL CASES within any group of four, the operands map one each to the four separate banks 00, 01, 10 and 11 in some order.
A B C D m0 m1 m2 m3 m5 m4 m7 m6 m0 m2 m4 m6 m5 m1 m7 m3 m0 m4 m7 m3 m5 m1 m2 m6
INSIGHT NUMBER TWO: Within any given group of four, the order of operations is unimportant; also the order of pairs within a given operation is unimportant. In other words, looking again at the last four operands in stage 1 or our example, the given order is (d1,d3)(d5,d7) but all EIGHT of the following possible sequences are equally valid:
* (d1,d3)(d5,d7) (d1,d3)(d7,d5) (d3,d1)(d5,d7) (d3,d1)(d7,d5) * (d5,d7)(d1,d3) (d5,d7)(d3,d1) (d7,d5)(d1,d3) (d7,d5)(d3,d1)
When we change the order of operands, we change the order of bank numbers being accessed; in particular, because stage 1 is odd, our eight reorderings correspond to the following eight possible bank sequences:
* (01,11)(00,10) (01,11)(10,00) (11,01)(00,10) (11,01)(10,00) * (00,10)(01,11) (00,10)(11,01) (10,00)(01,11) (10,00)(11,01)
IN PARTICULAR IF WE REPLACE THE TOGGLING BITS B(S+1),B(S) WITH THEIR PARITAL EQUIVALENTS P(S+1)P(S) WE ALWAYS GET A GROUP-OF-FOUR BANK ORDER (00,01,10,11) FOR THE EVEN STAGES AND (00,10,01,11) FOR THE ODD STAGES (IS THIS TRUE!!??)
In the general case, there are only four cases possible for the non-toggling bits, based on the parity of the even-numbered bits p(e) vs. the odd-numbered bits p(o). The four cases are p(o)p(e) = 00, 01, 10 and 11 respectively. We will assign banks to operands such that banknum = f(Po,Pe) where Po is the parity of ALL odd bits and Pe is the parity of ALL even bits. This means that any given sequential block of four operands will map to four separate banks. In our example,
- p(o) = parity(b3) = parity(0) = 0 and
- p(e) = parity(b4b0 = parity(01) = 1
(po,pe)(b(s+1)b(s)) = 00(00,01,10,11) 01(00,01,10,11) 10(00,01,10,11) and 11(00,01,10,11)
respectively, such that in each case the banknum counts (00,01,10,11) when s is even and consequently will count (00,10,01,11) when s is odd.
In our example (d1,d3,d5,d7) it is bits 1 and 2 that count while the remaining bit b0 is always 1. For this group, then, po is 0 (the empty set, actually) and pe is 1. Solving for group (d1,d3,d5,d7) solves for all groups such that s is odd, po is 0 and p1 is 1, such as
(d1,d3,d5,d7) (d25,d28,d30,d32) (41,d43,d45,d47) (d1,d9,d17,d25) (d2,d10,d18,d26) (d7,d15,d23,d31) etc.
[maybe]
Datapoint | Address (binary) | P1,P0 | bank |
---|---|---|---|
d00 d01 d02 d03 |
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d04 d05 d06 d07 |
0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 |
0,1 0,0 1,1 1,0 |
m1 m0 m3 m2 |
d08 d09 d10 d11 |
1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 |
1,1 1,0 0,1 0,0 |
m3 m2 m1 m0 |
d12 d13 d14 d15 |
1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 |
1,0 1,1 0,0 0,1 |
m2 m3 m0 m1 |
<=== (Stages 0, 1, 2, same as before) ===> | (new) | ||
---|---|---|---|
Stage 0 | Stage 1 | Stage 2 | Stage 3 |
|
|
|
|
d0=0000 | d0=0000 | d0=0000 | d0=0000 |
d1=0001 | d2=0010 | d2=0100 | d4=1000 |
d2=0010 | d4=0100 | d4=1000 | d1=0001 |
d3=0011 | d6=0110 | d6=1100 | d5=1001 |
|
|
|
|
d4=0100 | d1=0001 | d1=0001 | d2=0010 |
d5=0101 | d3=0011 | d3=0101 | d6=1010 |
d6=0110 | d5=0101 | d5=1001 | d3=0011 |
d7=0111 | d7=0111 | d7=1101 | d7=1011 |
|
|
|
|
... | ... | ... | ... |
d7=1110 | d7=1101 | d7=1011 | d7=0111 |
d7=1111 | d7=1111 | d7=1111 | d7=1111 |
Given
- a transform with N datapoints such that N is a power of two and N β₯ 16; and
- an even number of stages S, that is, S=log2(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).
The following equation(s) transform a given stage's datapoint sequence
- (dp(N-1) dp(N-2)...dp1,dp0)
- (dp'(N-1) dp'(N-2)...dp'1,dp'0)
- (mb'(N-1) mb'(N-2)...mb'0,mb'1)
- (m0,m1,m2,m3, m0,m1,m2,m3 ... m0,m1,m2,m3),
So:
For each S-bit datapoint dps,i in a given transform stage s β (0..S-1),
- dps,i = dS-1dS-2...d1d0
- dp's,i = d'S-1d'S-2...d'1d'0
if (s==S-1) then | d'0 = P0 and d's = P1 | and all other d'j = dj , where j β (1..S-2) |
else | d's = Ps and d'(s+1) = P(s+1) | and all other d'j = dj , where j β (0..S-1) and j β {s,s+1} |
and
Pk = P0 = Peven | if k is even and |
Pk = P1 = Podd | if k is odd |
and
- P0 = Peven = Parity(dS-2,dS-4...d2,d0)
- P1 = Podd = Parity(dS-1,dS-3...d3,d1)
The equations work for
- a transform with N datapoints such that N is a power of two and N β₯ 16; and
- an even number of stages S, that is, S=log2(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).
Datapoint d (decimal) |
Address (binary) = d3d2d1d0 |
P1,P0 |
Datapoint d' = Replace ds+1ds (d1d0) with P1P0 (reorder) |
P'1,P'0 | bank = P'1P'0 |
---|---|---|---|---|---|
d00 d01 d02 d03 |
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 |
0,0 0,1 1,0 1,1 |
0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d04 d05 d06 d07 |
0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 |
0,1 0,0 1,1 1,0 |
0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d08 d09 d10 d11 |
1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 |
1,1 1,0 0,1 0,0 |
1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d12 d13 d14 d15 |
1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 |
1,0 1,1 0,0 0,1 |
1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
Datapoint d=d' (decimal) |
Address (binary) = d3d2d1d0 = d'3d'2d'1d'0 |
P'1,P'0 |
Reorder: d" = d'3d'2P'1P'0 |
P"1,P"0 | bank = P"1P"0 |
---|---|---|---|---|---|
d00 d01 d02 d03 |
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 |
0,0 0,1 1,0 1,1 |
0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d04 d05 d06 d07 |
0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 |
0,1 0,0 1,1 1,0 |
0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d08 d09 d10 d11 |
1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 |
1,1 1,0 0,1 0,0 |
1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d12 d13 d14 d15 |
1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 |
1,0 1,1 0,0 0,1 |
1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
Datapoint d (decimal) |
Address (binary) = d3d2d1d0 |
P1,P0 |
Datapoint d' = Replace ds+1ds (d1d0) with P1P0 (reorder) |
P'1,P'0 | bank = P'1P'0 |
---|---|---|---|---|---|
d00 d01 d02 d03 |
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 |
0,0 0,1 1,0 1,1 |
0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d04 d05 d06 d07 |
0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 |
0,1 0,0 1,1 1,0 |
0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d08 d09 d10 d11 |
1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 |
1,1 1,0 0,1 0,0 |
1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
d12 d13 d14 d15 |
1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 |
1,0 1,1 0,0 0,1 |
1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13 |
0,0 0,1 1,0 1,1 |
m0 m1 m2 m3 |
If the number of bits is odd, we have a slight problem. Everything is fine as before from stage 0 up through the penultimate stage S-2. But then things change for the final stage S-1, where the toggling bits are the two outside bits dS-1 (fast toggle 0,1,0,1) and d0 (slow toggle 0,0,1,1). In the previous case, where there were an even number of address bits, we knew that dS-1 was odd and d0 was even, and thus Podd would toggle 0,1,0,1 (or 1,0,1,0, depending on the value of the non-toggling bits) while Peven would toggle 0,0,1,1 (or 1,1,0,0).
In the case where the number of bits is odd, however, the two outside (toggling) bits dS-1 and d0 are both even. Thus, as d0dS-1 count (00,01,10,11), Podd does not change at all (because the toggling bits are both even) while Peven counts either (0,1,1,0) (or (1,0,0,1), depending on the value of the non-toggling bits) instead of its usual 0-0-1-1 (or 1-1-0-0). Thus, if base the memory bank number on PoddPeven we only get two memory banks repeated twice, e.g. 00-01-00-01, instead of the desired mapping to four separate memory banks 00-01-10-11
Talk about what happens in final stage when s==nbits and nbits is odd.
Stage 0 | Stage 1 | Stage 2 |
---|---|---|
d0=000 | d0=000 | d0=000 |
d1=001 | d2=010 | d4=100 |
d2=010 | d4=100 | d1=001 |
d3=011 | d6=110 | d5=101 |
d4=100 | d1=001 | d2=010 |
d5=101 | d3=011 | d6=110 |
d6=110 | d5=101 | d3=011 |
d7=111 | d7=111 | d7=111 |
Try this: for stage s=S-1, instead of
- d_s = d_(S-1) = P_s = P_1 (because S is even and thus S-1 is odd)
- d_(s+1) = d_0 = P_(s+1) = P_0,
do this maybe
- d_s = d_(S-1) = xor(P_s,d_0) = xor(P_(S-1),d_0) = xor(P_0,d_0) (because S is odd and thus (S-1) is even) (also note xor(P_0,d_0) is same as P_0) (right?)
- d_(s+1) = d_S = d_1 (why? because d_(S-1)==d_0 sorta?) (because
- d_1 = xor(P_(s+1),d_0) = xor(P_1,d_0)
Equation for mapping addresses to banks for one butterly (B==1) at radix 2 (r==2)
Projects: FFT_Formal_Proof_-_Auxiliary_Info#Sketching_out_the_remainder_of_the_proof
Eventually we will find and show that, by extension, if we have
- an FFT with B butterflies (and thus M=4B memory banks);
- a transform with S=log2(N) stages such that modMbits(S)=0, where Mbits=log2 (M)
- (e.g. B=1βM=4βMbits=2 and modMbits(S)=mod2(S)=0βnumber of stages S is even);
everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above
The mapping of any given datapoint
- d =dS-1dS-2...d1d0
- m=mMbits-1mMbits-2...m1m0
- mMbits-1 = P(Mbits-1)/Mbits
- mMbits-2 = P(Mbits-2)/Mbits
- m1 = P1/Mbits
- m0 = P0/Mbits
- PMbits-1/Mbits = Parity(dS-1,dS-Mbits-1...d2Mbits-1,dMbits-1)
- PMbits-2/Mbits = Parity(dS-2,dS-Mbits-2...d2Mbits-2,dMbits-2)
- P1/Mbits = Parity(dS-(Mbits-1),dS-(2Mbits-1)...dMbits+1,d1)
- P0/Mbits = Parity(dS-(Mbits),dS-(2Mbits)...dMbits,d0)