Table of Contents Housekeeping/Schedule (or: how/when I'm gonna be done with this) Takala equation and paper TODO/NEXT/CONTINUE FROM HERE ---------------------------- SCRUBBED TO HERE MAYBE Notes - bank num encoding table The Proof (old) Sketching out the remainder of the proof extending to nbutts > 1 extending to radix > 2 Trash trash 1/27/2014 trash Trash: clean example table Trash: non-adjacent toggle bits table Trash: Algorithm 1: One radix-2 butterfly (R==2 and B==1) and even number of address bits Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (new) Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (old) TRASH/OLD Boundary condition, odd number of bits Final equation for when nbits is odd and R==2 and B==1 TRASH/OLD Algorithm 2: Final equation(s) for R=2 and B=1 TRASH/OLD The ... REST ... of the story? TRASH/OLD Extension to more than one butterfly unit (B > 1). Generalization of Algorithm 1, extended for any B=2ⁱ and any number of address bits S: mod_MbitsS=0 everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above

Housekeeping/Schedule (or: how/when I'm gonna be done with this)

Checkoff list:

   Target     Actual
   Finish     Finish    Task
  ---------  ---------  ----------------------------------------------
  Tue 11/19  Tue 11/19  Move the old stuff to an auxiliary page. DONE!
  ...
  Tue 11/19  Tue 11/19  Clean up the outline, figure out where to put
                        the "nomenclature" section (see below)
  ...
  Tue 11/19  xxx xx/xx  Build a timeline for what gets done when, 
                        working like twelve hours a week or whatever.
  ...
  Tue 11/19  Tue 11/19  Start on nomenclature section: what is N, n, N_B etc. 
  ...
  Thu 11/19  Tue 11/19  Finish nomenclature section: what is N, n, N_B etc. 
  ...
  Thu 11/19  xxx xx/xx  Continue with the proof, in order of sections not finished
  ...
  TBD List of sections, including when each one will be done
  ...
  xxx xx/xx  xxx xx/xx  Section 5
  xxx xx/xx  xxx xx/xx    5.1 The equation so far
  xxx xx/xx  xxx xx/xx    5.2 Boundary condition, odd number of bits
  xxx xx/xx  xxx xx/xx    5.3 Boundary condition, even number of bits
  xxx xx/xx  xxx xx/xx    5.4 Final equation for r==2 and B==1
  xxx xx/xx  xxx xx/xx    Final polish of section 5
  ...
  xxx xx/xx  xxx xx/xx  Section 6
  xxx xx/xx  xxx xx/xx    6.1 Extension to more than one butterfly unit (B > 1).
  xxx xx/xx  xxx xx/xx    6.2 Extension to radix r > 2.
  xxx xx/xx  xxx xx/xx    6.3 Extension to longer pipelines.

Takala equation and paper

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1205957

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1205957&tag=1

http://genesis2.stanford.edu/mediawiki/index.php/File:Takala_equations.jpg

TODO/NEXT/CONTINUE FROM HERE ----------------------------

SCRUBBED TO HERE MAYBE

Notes - bank num encoding table

Bank num m	Bits m_n+1m_nm_n-1...m₀
m0	m_n+1m_nm_n-1...m₀ = 00...00₂ = 0
m1	m_n+1m_nm_n-1...m₁ = 00...01₂ = 1
↓	↓
mn	m_n+1m_nm_n-1...m₀ = 11...10₂ = 4B-2
m(n+1)	m_n+1m_nm_n-1...m₀ = 11...11₂ = 4B-1

bank m0: m_n+1m_nm_n-1...m₀ = 00...00₂
bank m1: m_n+1m_nm_n-1...m₁ = 00...01₂
...
bank mn: m_n+1m_nm_n-1...m₀ = 11...10₂
bank m(n+1): m_n+1m_nm_n-1...m₀ = 11...11₂

OLD

m_n+1m_nm_n-1...m₀={(00...00₂),(00...01₂), ... (11...10₂),(11...11₂)} i.e.
m0=00..00₂, m1=00..01₂, mn=11..10₂ and m(n+1)=11..11₂.

The Proof (old)

Note this part of the proof is old and possibly unnecessary; all the key information is in the section currently labeled "The Problem," above.

Here is an algorithm for an FFT schedule free of interstage memory access conflicts along with an embedded formal proof of correctness. The proof is developed for an FFT implemented as a single time-sliced and pipelined radix-2 butterfly unit, e.g. one radix-two butterfly transforming eight datapoints. The proof will later be extended to cover any power-of-2 number of butterfly units at any power-of-2 radix, e.g. four radix-four butterflies transforming 1024 datapoints.

The initial proof posits an FFT based on the Cooley-Tukey [verify] algorithm, implemented as a single radix-2 butterfly. The butterfly unit repeatedly reads two input operands, calculates two results, then writes the results back to memory, repeating with new input operands until the FFT operation is complete. The butterfly unit is pipelined such that in continuous operation, the read of two new operands happens at the same time as the writeback of the two previous results. [maybe] This obviously requires a minimum of four memory accesses per cycle, using say a single four-ported memory bank, two dual-ported memory banks, or four independent single-ported memory banks. Our schedule targets the most efficient of these choices, four independent banks of single-ported memory.

Given the constraints outlined above, our schedule must place operands in memory and schedule the operand accesses such that any given group of four operands (two reads plus two writes) reside in four separate memory banks. As an example, consider the FFT schedule in Fig. 1 below, for N=8 datapoints. When implemented as a single time-sliced butterfly unit, the memory accesses occur as shown in Fig 2. For simplicity we assume that datapoint dp0 resides at memory location m0, dp1 at m1 and so on.

Fig. 1 goes here: std butterfly network for eight datapoints

Fig. 2: stretched-out version of Fig. 1 I guess

In cycle 0 of Fig. 1, the butterfly unit accesses datapoints from memory locations m0 and m1. In cycle 1, the butterfly unit writes its results back to m0 and m1 and reads new datapoints from m2 and m3. For this to happen without a stall, m0, m1, m2 and m3 must all reside in separate memory banks. Working through the entire diagram, we find the following nine groups of memory locations such that, during some cycle of execution, each memory location within the group gets accessed at the same time as each of the other three members of the group, and thus all four must live in separate banks for conflict-free operation.

 {m0,m1,m2,m3} {m2,m3,m4,m5} {m4,m5,m6,m7}
 {m0,m2,m4,m6} {m4,m6,m1,m5} {m1,m5,m3,m7}
 {m0,m4,m1,m5} {m1,m5,m2,m6} {m2,m6,m3,m7}

So, for example, locations m0, m1, m2 and m3 can live in banks b0, b1, b2 and b3 respectively, thus satisfying the separateness requirements for group 1. In group 2, locations m2 and m3 have already been assigned to b2 and b3, and thus m4 and m5 must be assigned to b0 and b1. If we follow this through to group 3, we find that m6 and m7 have to reside in banks b2 and b3. By the time we get to group 4, we've already assigned m0, m2, m4 and m6 to b0, b2, b0 and b2, which is a problem because this causes collisions in both banks b0 and b2.

Finding the correct assignment, such that no collisions occur in any of the nine groups, is a very hard problem that may not be solvable for all values of N, much less for our little example where N=8. However, we can change the rules slightly. Note that the operations within a given FFT stage can occur in any order and still yield correct results. Thus in place of the hard-to-schedule network of Fig. 2 we could substitute a different ordering, such as that given by Fig. 3.

Fig. 3: tractable reordered schedule

Now the groups that must reside in separate banks are as follows:

 m0 m1 m2 m3    m2 m3 m5 m4    m5 m4 m7 m6
 m0 m2 m4 m6    m4 m6 m5 m1    m5 m1 m7 m3
 m0 m4 m7 m3    m7 m3 m5 m1    m5 m1 m2 m6

Given this schedule, we can achieve our goal by placing datapoints m0 and m5 in bank b0, m1 and m4 in bank b1, m2 and m7 in bank b2, and m3 and m6 in bank b3:

[Below,]

  A  B  C  D     C  D  A  B     A  B  C  D
 m0 m1 m2 m3    m2 m3 m5 m4    m5 m4 m7 m6

  A  C  B  D     B  D  A  C     A  C  B  D
 m0 m2 m4 m6    m4 m6 m5 m1    m5 m1 m7 m3

  A  B  C  D     C  D  A  B     A  B  C  D
 m0 m4 m7 m3    m7 m3 m5 m1    m5 m1 m2 m6

[maybe]

The remainder of this proof will show how to deterministically achieve a schedule and a mapping for similar conflict-free operation given any number of datapoints N=2^i, where i is any integer greater than 2 (cases for i <= 2 are trivial and obvious).

Sketching out the remainder of the proof

The Cooley-Tukey algorithm [check] for sixteen datapoints yields the following schedule:

  stage 0:  0 1 2 3  4 5 6 7  8 9 a b  c d e f
  stage 1:  0 2 4 6  8 a c e  1 3 5 7  9 b d f
  stage 2:  0 4 8 c  1 5 9 d  2 6 a e  3 7 b f
  stage 3:  0 8 1 9  2 a 3 b  4 c 5 d  6 e 7 f

If we write these operands in binary form it looks something like this:

  stage 0    stage 1    stage 2    stage 3
  -------    -------    -------    -------    
  0 0 0 0    0 0 0 0    0 0 0 0    0 0 0 0
  0 0 0 1    0 0 1 0    0 1 0 0    1 0 0 0
  0 0 1 0    0 1 0 0    1 0 0 0    0 0 0 1
  0 0 1 1    0 1 1 0    1 1 0 0    1 0 0 1

  0 1 0 0    1 0 0 0    0 0 0 1    0 0 1 0
  0 1 0 1    1 0 1 0    0 1 0 1    1 0 1 0
  0 1 1 0    1 1 0 0    1 0 0 1    0 0 1 1
  0 1 1 1    1 1 1 0    1 1 0 1    1 0 1 1

  1 0 0 0    0 0 0 1    0 0 1 0    0 1 0 0
  1 0 0 1    0 0 1 1    0 1 1 0    1 1 0 0
  1 0 1 0    0 1 0 1    1 0 1 0    0 1 0 1
  1 0 1 1    0 1 1 1    1 1 1 0    1 1 0 1

  1 1 0 0    1 0 0 1    0 0 1 1    0 1 1 0
  1 1 0 1    1 0 1 1    0 1 1 1    1 1 1 0
  1 1 1 0    1 1 0 1    1 0 1 1    0 1 1 1
  1 1 1 1    1 1 1 1    1 1 1 1    1 1 1 1

Note that, in any given group of four within any given stage, only two bits toggle while the remaining bits stay the same. For the purposes of this discussion, we'll number the bits 0 to 3, right to left, where bit 0 is the least significant bit (LSB) and bit 3 is the most significant bit (MSB).

In stage 0, bits 0 and 1 toggle while bits 2 and 3 stay at 00 (in the first group of four), then 01 (in the second group), then 10 and then 11.

In stage 1 it is bits 1 and 2 that toggle while bits 3 and 0 remain constant within each group of four.

In general, for stage s, bits s+1 and s within any group of four will count 00,01,10,11 while the remaining bits s-1, s-2, ... increment only at group-of-four boundaries, where s-i wraps back to the MSB if s-i is less than 0. More precisely, given S stages [0..(S-1]], then within each group of four in stage s, bits

b_[(s+1)] b_s

count (00,01,10,11) while the remaining bits

b_[(s+S)] b_[(s+(S-1))] b_[(s+(S-2))] ... b_[(s+2)]

increment (0,1,10,11,100,101...) at each group-of-four boundary.

Our example illustrates this for N=16 datapoints, but the principle holds for any arbitrarily large number of datapoints. Bits s+1 and s count 00 to 11 while the remaining bits increment at the end of each group of four, counting up e.g. if N=1024 then the second group-of-four in stage 2 would be 0000001(00)0, 0000001(01)0, 0000001(10)0, 0000001(11)0.

Note that things get a little "funny" in the final stage of execution, when we run out of bits. In our example, at stage 3, it is not bits 3 and 4 that count up from 00 to 11 in each group of four (bit 4 doesn't exist); it is bits 3 and 0. That is, the toggling pair has wrapped back around to the beginning of the word. Things get even "funnier" when there is an odd number of bits, we'll talk about this later.

We're going to use the property we've just discovered to produce a conflict-free schedule. In particular, we're going to make sure that 1) within each group of four, the four operands will map to the four separate banks b0, b1, b2 and b3 in some order. And then 2) we're going to reorder the operations within each group of four such that bank accesses are strictly ordered within the group, namely b0 then b1 then b2 then b3 for even-numbered stages and b0 then b2 then b1 then b3 for odd-numbered stages(!)

extending to nbutts > 1

Note that in any group of EIGHT operands, bits (s, s+1, s+2) count 000,001,010,011,100,101,110,111 while the remaining bits stay the same. Similarly for groups of 16, 32, 64, etc. etc. This allows us to extend our earlier proof in the following way...

extending to radix > 2

Note that the formulation for a radix-four schedule is similar to that for two radix-2 butterflies operating in parallel. This is the basis for an obvious extension to any power-of-two radix greater than two etc. etc.

Trash

trash 1/27/2014

Yes it is because within any given group of four, the order of operations is unimportant; also the order of pairs within a given operation is unimportant. In other words, looking again at the last four operands in Stage 1 of our example above, the given order is (d1,d3)(d5,d7) but all EIGHT of the following possible sequences are equally valid:

(d1,d3)(d5,d7); (d1,d3)(d7,d5); (d3,d1)(d5,d7); (d3,d1)(d7,d5);
(d5,d7)(d1,d3); (d5,d7)(d3,d1); (d7,d5)(d1,d3); (d7,d5)(d3,d1).

Further suppose that, when stage number s is even and when P_od is about to count (1,1,0,0), we change d_s+1 (a component of P_od, because we know s+1 is odd) such that it counts (1,1,0,0) instead of (0,0,1,1). (In our example sequence (d1,d3,d5,d7) we would not need to make this change, because P_od already counts the way we want it to.) Is it okay to change the sequence like this? Yes it is because this is a valid reordering, as discussed above. And now P_od for the new ordering counts (0,0,1,1) instead of the original (1,1,0,0).

When the stage number is odd, we know that d_s is an odd bit and thus it is P_od that will be either (0,1,0,1) or (1,0,1,0) while P_ev is (0,0,1,1) or (1,1,0,0). Bank number P_odP_ev for a quad will thus be one of: (00,10,01,11); (01,11,00,10); (10,00,11,01); or (11,01,10,00).

Now let's conspire to make the memory banks count consistently (m0,m1,m2,m3, m0,m1,m2,m3,...) for even-numbered stages. For this to happen, Pev will need to consistently count (0,1,0,1, 0,1,0,1, ...) and Pod will need to consistently count (0,0,1,1, 0,0,1,1, ...) such that memory bank number PodPev will count (00,01,10,11, 00,01,10,11, ...)

If, however, we replace b(s) with P_ev then we know that P_ev will always count (0,1,0,1) (why?)

in particular if po/pe==00 then the bank order will be (00,01,10,11); if pope==01 the order will be (01,00,11,10); if pope==10 the order will be (10,11,00,01); and if pope==11 the order will be (11,10,01,00).

IF WE REPLACE BS+1/BS WITH POPE WE ALWAYS GET THE ORDER (00,01,10,11) (WHY?)

Note that IN ALL CASES within any group of four, the operands map one each to the four separate banks 00, 01, 10 and 11 in some order.

trash

  A  B  C  D
 m0 m1 m2 m3
 m5 m4 m7 m6
 m0 m2 m4 m6
 m5 m1 m7 m3
 m0 m4 m7 m3
 m5 m1 m2 m6

INSIGHT NUMBER TWO: Within any given group of four, the order of operations is unimportant; also the order of pairs within a given operation is unimportant. In other words, looking again at the last four operands in stage 1 or our example, the given order is (d1,d3)(d5,d7) but all EIGHT of the following possible sequences are equally valid:

 * (d1,d3)(d5,d7)  (d1,d3)(d7,d5)  (d3,d1)(d5,d7)  (d3,d1)(d7,d5)
 * (d5,d7)(d1,d3)  (d5,d7)(d3,d1)  (d7,d5)(d1,d3)  (d7,d5)(d3,d1)

When we change the order of operands, we change the order of bank numbers being accessed; in particular, because stage 1 is odd, our eight reorderings correspond to the following eight possible bank sequences:

 * (01,11)(00,10)  (01,11)(10,00)  (11,01)(00,10)  (11,01)(10,00)
 * (00,10)(01,11)  (00,10)(11,01)  (10,00)(01,11)  (10,00)(11,01)

IN PARTICULAR IF WE REPLACE THE TOGGLING BITS B(S+1),B(S) WITH THEIR PARITAL EQUIVALENTS P(S+1)P(S) WE ALWAYS GET A GROUP-OF-FOUR BANK ORDER (00,01,10,11) FOR THE EVEN STAGES AND (00,10,01,11) FOR THE ODD STAGES (IS THIS TRUE!!??)

In the general case, there are only four cases possible for the non-toggling bits, based on the parity of the even-numbered bits p(e) vs. the odd-numbered bits p(o). The four cases are p(o)p(e) = 00, 01, 10 and 11 respectively. We will assign banks to operands such that banknum = f(Po,Pe) where Po is the parity of ALL odd bits and Pe is the parity of ALL even bits. This means that any given sequential block of four operands will map to four separate banks. In our example,

p(o) = parity(b₃) = parity(0) = 0 and
p(e) = parity(b₄b₀ = parity(01) = 1

Next we will come up with four formulas to handle the four cases

  (po,pe)(b(s+1)b(s)) = 00(00,01,10,11) 01(00,01,10,11) 10(00,01,10,11) and 11(00,01,10,11)

respectively, such that in each case the banknum counts (00,01,10,11) when s is even and consequently will count (00,10,01,11) when s is odd.

In our example (d1,d3,d5,d7) it is bits 1 and 2 that count while the remaining bit b0 is always 1. For this group, then, po is 0 (the empty set, actually) and pe is 1. Solving for group (d1,d3,d5,d7) solves for all groups such that s is odd, po is 0 and p1 is 1, such as

  (d1,d3,d5,d7) (d25,d28,d30,d32) (41,d43,d45,d47)
  (d1,d9,d17,d25) (d2,d10,d18,d26) (d7,d15,d23,d31)
   etc.

[maybe]

Trash: clean example table

**An example table**
Datapoint	Address (binary)	P₁,P₀	bank
d00 d01 d02 d03	0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d04 d05 d06 d07	0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1	0,1 0,0 1,1 1,0	m1 m0 m3 m2
d08 d09 d10 d11	1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1	1,1 1,0 0,1 0,0	m3 m2 m1 m0
d12 d13 d14 d15	1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1	1,0 1,1 0,0 0,1	m2 m3 m0 m1

Trash: non-adjacent toggle bits table

<=== (Stages 0, 1, 2, same as before) ===>	(new)
Stage 0	Stage 1	Stage 2	Stage 3

d0=0000	d0=0000	d0=0000	d0=0000
d1=0001	d2=0010	d2=0100	d4=1000
d2=0010	d4=0100	d4=1000	d1=0001
d3=0011	d6=0110	d6=1100	d5=1001

d4=0100	d1=0001	d1=0001	d2=0010
d5=0101	d3=0011	d3=0101	d6=1010
d6=0110	d5=0101	d5=1001	d3=0011
d7=0111	d7=0111	d7=1101	d7=1011

...	...	...	...
d7=1110	d7=1101	d7=1011	d7=0111
d7=1111	d7=1111	d7=1111	d7=1111

Trash: Algorithm 1: One radix-2 butterfly (R==2 and B==1) and even number of address bits

Given

a transform with N datapoints such that N is a power of two and N ≥ 16; and
an even number of stages S, that is, S=log₂(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).

For the algorithm to work correctly for one butterfly (B=1), we should have four memory banks (M=4) numbered 00₂ to 11₂ i.e. m₁m₀={00₂,01₂,10₂,11₂} i.e. m0=00₂, m1=01₂, m2=10₂ and m3=11₂.

The following equation(s) transform a given stage's datapoint sequence

(dp_(N-1) dp_(N-2)...dp₁,dp₀)

into an equivalent sequence

(dp'_(N-1) dp'_(N-2)...dp'₁,dp'₀)

such that each datapoint dp'_i maps to a memory bank mb_i where the corresponding memory bank sequence

(mb'_(N-1) mb'_(N-2)...mb'₀,mb'₁)

is the repeating series

(m0,m1,m2,m3, m0,m1,m2,m3 ... m0,m1,m2,m3),

where m0 = memory bank 00₂, m1 = memory bank 01₂, m2 = memory bank 10₂ and m3 = memory bank 11₂.

So:

For each S-bit datapoint dp_s,i in a given transform stage s ∈ (0..S-1),

dp_s,i = d_S-1d_S-2...d₁d₀

and i ∈ (0..N-1), redefine it as a new datapoint

dp'_s,i = d'_S-1d'_S-2...d'₁d'₀

where

if (s==S-1) then	d'₀ = P₀ and d'_s = P₁	and all other d'_j = d_j , where j ∈ (1..S-2)
else	d'_s = P_s and d'_(s+1) = P_(s+1)	and all other d'_j = d_j , where j ∈ (0..S-1) and j ∉ {s,s+1}

and

P_k = P₀ = P_even	if k is even and
P_k = P₁ = P_odd	if k is odd

and

P₀ = P_even = Parity(d_S-2,d_S-4...d₂,d₀)

is the parity of the datapoint's even bits, that is all those bits d_j such that j mod 2 = 0; and

P₁ = P_odd = Parity(d_S-1,d_S-3...d₃,d₁)

is the parity of the datapoint's odd bits, that is all those bits d_j such that j mod 2 = 1.

The equations work for

a transform with N datapoints such that N is a power of two and N ≥ 16; and
an even number of stages S, that is, S=log₂(N) is an even number where N is the number of datapoints to be transformed (note that the number of address bits is also equal to S).

That is, they work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1)

**Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1)**
Datapoint d (decimal)	Address (binary) = d₃d₂d₁d₀	P₁,P₀	Datapoint d' = Replace d_s+1d_s (d₁d₀) with P₁P₀ (reorder)	P'₁,P'₀	bank = P'₁P'₀
d00 d01 d02 d03	0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1	0,0 0,1 1,0 1,1	0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d04 d05 d06 d07	0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1	0,1 0,0 1,1 1,0	0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d08 d09 d10 d11	1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1	1,1 1,0 0,1 0,0	1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d12 d13 d14 d15	1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1	1,0 1,1 0,0 0,1	1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13	0,0 0,1 1,0 1,1	m0 m1 m2 m3

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (new)

**Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1)**
Datapoint d=d' (decimal)	Address (binary) = d₃d₂d₁d₀ = d'₃d'₂d'₁d'₀	P'₁,P'₀	Reorder: d" = d'₃d'₂P'₁P'₀	P"₁,P"₀	bank = P"₁P"₀
d00 d01 d02 d03	0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1	0,0 0,1 1,0 1,1	0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d04 d05 d06 d07	0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1	0,1 0,0 1,1 1,0	0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d08 d09 d10 d11	1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1	1,1 1,0 0,1 0,0	1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d12 d13 d14 d15	1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1	1,0 1,1 0,0 0,1	1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13	0,0 0,1 1,0 1,1	m0 m1 m2 m3

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (old)

**Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1)**
Datapoint d (decimal)	Address (binary) = d₃d₂d₁d₀	P₁,P₀	Datapoint d' = Replace d_s+1d_s (d₁d₀) with P₁P₀ (reorder)	P'₁,P'₀	bank = P'₁P'₀
d00 d01 d02 d03	0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1	0,0 0,1 1,0 1,1	0 0 0 0 = d00 0 0 0 1 = d01 0 0 1 0 = d02 0 0 1 1 = d03	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d04 d05 d06 d07	0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1	0,1 0,0 1,1 1,0	0 1 0 1 = d05 0 1 0 0 = d04 0 1 1 1 = d07 0 1 1 0 = d06	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d08 d09 d10 d11	1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1	1,1 1,0 0,1 0,0	1 0 1 1 = d11 1 0 1 0 = d10 1 0 0 1 = d09 1 0 0 0 = d08	0,0 0,1 1,0 1,1	m0 m1 m2 m3
d12 d13 d14 d15	1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1	1,0 1,1 0,0 0,1	1 1 1 0 = d14 1 1 1 1 = d15 1 1 0 0 = d12 1 1 0 1 = d13	0,0 0,1 1,0 1,1	m0 m1 m2 m3

TRASH/OLD Boundary condition, odd number of bits

If the number of bits is odd, we have a slight problem. Everything is fine as before from stage 0 up through the penultimate stage S-2. But then things change for the final stage S-1, where the toggling bits are the two outside bits d_S-1 (fast toggle 0,1,0,1) and d₀ (slow toggle 0,0,1,1). In the previous case, where there were an even number of address bits, we knew that d_S-1 was odd and d₀ was even, and thus P_odd would toggle 0,1,0,1 (or 1,0,1,0, depending on the value of the non-toggling bits) while P_even would toggle 0,0,1,1 (or 1,1,0,0).

In the case where the number of bits is odd, however, the two outside (toggling) bits d_S-1 and d₀ are both even. Thus, as d₀d_S-1 count (00,01,10,11), P_odd does not change at all (because the toggling bits are both even) while P_even counts either (0,1,1,0) (or (1,0,0,1), depending on the value of the non-toggling bits) instead of its usual 0-0-1-1 (or 1-1-0-0). Thus, if base the memory bank number on P_oddP_even we only get two memory banks repeated twice, e.g. 00-01-00-01, instead of the desired mapping to four separate memory banks 00-01-10-11

~~Talk about what happens in final stage when s==nbits and nbits is odd.~~

**note stage 1 is WRONG should be 0,2,4,6,8,10,12,14 not 0,2,4,6,1,3,...**
Stage 0	Stage 1	Stage 2
d0=000	d0=000	d0=000
d1=001	d2=010	d4=100
d2=010	d4=100	d1=001
d3=011	d6=110	d5=101
d4=100	d1=001	d2=010
d5=101	d3=011	d6=110
d6=110	d5=101	d3=011
d7=111	d7=111	d7=111

https://skydrive.live.com/view.aspx?cid=8595D345F1A90105&resid=8595D345F1A90105%213371&app=WordPdf&wdo=1

Try this: for stage s=S-1, instead of

d_s = d_(S-1) = P_s = P_1 (because S is even and thus S-1 is odd)

and

d_(s+1) = d_0 = P_(s+1) = P_0,

do this maybe

d_s = d_(S-1) = xor(P_s,d_0) = xor(P_(S-1),d_0) = xor(P_0,d_0) (because S is odd and thus (S-1) is even) (also note xor(P_0,d_0) is same as P_0) (right?)

and

d_(s+1) = d_S = d_1 (why? because d_(S-1)==d_0 sorta?) (because

and

d_1 = xor(P_(s+1),d_0) = xor(P_1,d_0)

Final equation for when nbits is odd and R==2 and B==1

TRASH/OLD Algorithm 2: Final equation(s) for R=2 and B=1

Equation for mapping addresses to banks for one butterly (B==1) at radix 2 (r==2)

TRASH/OLD The ... REST ... of the story?

Projects: FFT_Formal_Proof_-_Auxiliary_Info#Sketching_out_the_remainder_of_the_proof

TRASH/OLD Extension to more than one butterfly unit (B > 1).

Generalization of Algorithm 1, extended for any B=2ⁱ and any number of address bits S: mod_MbitsS=0

Eventually we will find and show that, by extension, if we have

an FFT with B butterflies (and thus M=4B memory banks);
a transform with S=log₂(N) stages such that mod_Mbits(S)=0, where Mbits=log₂ (M)
(e.g. B=1⇒M=4⇒Mbits=2 and mod_Mbits(S)=mod₂(S)=0⇒number of stages S is even);

everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above

The mapping of any given datapoint

d =d_S-1d_S-2...d₁d₀

to its proper memory bank

m=m_Mbits-1m_Mbits-2...m₁m₀

is as follows:

m_Mbits-1 = P_{(Mbits-1)/Mbits}
m_Mbits-2 = P_{(Mbits-2)/Mbits}
m₁ = P_1/Mbits
m₀ = P_0/Mbits

where

P_{Mbits-1/Mbits} = Parity(d_S-1,d_S-Mbits-1...d_2Mbits-1,d_Mbits-1)

is the parity of all those bits d_i such that i mod Mbits = (Mbits-1); and

P_{Mbits-2/Mbits} = Parity(d_S-2,d_S-Mbits-2...d_2Mbits-2,d_Mbits-2)

is the parity of all those bits d_i such that i mod Mbits = (Mbits-2); and

P_1/Mbits = Parity(d_S-(Mbits-1),d_S-(2Mbits-1)...d_Mbits+1,d₁)

is the parity of all those bits d_i such that i mod Mbits = 1; and

P_0/Mbits = Parity(d_S-(Mbits),d_S-(2Mbits)...d_Mbits,d₀)

is the parity of all those bits d_i such that i mod Mbits = 0; and The equations thus work for number of datapoints N=16,64,256... and thus log(N)=4,6,8... (even) but not for N=8,32,128...where log(N)=3,5,7... is odd.

Projects: FFT Formal Proof Auxiliary Info - StanfordVLSI/Genesis2 GitHub Wiki

Table of Contents

Housekeeping/Schedule (or: how/when I'm gonna be done with this)

Takala equation and paper

TODO/NEXT/CONTINUE FROM HERE ----------------------------

SCRUBBED TO HERE MAYBE

Notes - bank num encoding table

The Proof (old)

Sketching out the remainder of the proof

extending to nbutts > 1

extending to radix > 2

Trash

trash 1/27/2014

trash

Trash: clean example table

Trash: non-adjacent toggle bits table

Trash: Algorithm 1: One radix-2 butterfly (R==2 and B==1) and even number of address bits

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1)

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (new)

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (old)

TRASH/OLD Boundary condition, odd number of bits

Final equation for when nbits is odd and R==2 and B==1

TRASH/OLD Algorithm 2: Final equation(s) for R=2 and B=1

TRASH/OLD The ... REST ... of the story?

TRASH/OLD Extension to more than one butterfly unit (B > 1).

Generalization of Algorithm 1, extended for any B=2ⁱ and any number of address bits S: mod_MbitsS=0

everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above

⚠️ GitHub.com Fallback ⚠️

Projects: FFT Formal Proof Auxiliary Info - StanfordVLSI/Genesis2 GitHub Wiki

Table of Contents

Housekeeping/Schedule (or: how/when I'm gonna be done with this)

Takala equation and paper

TODO/NEXT/CONTINUE FROM HERE ----------------------------

SCRUBBED TO HERE MAYBE

Notes - bank num encoding table

The Proof (old)

Sketching out the remainder of the proof

extending to nbutts > 1

extending to radix > 2

Trash

trash 1/27/2014

trash

Trash: clean example table

Trash: non-adjacent toggle bits table

Trash: Algorithm 1: One radix-2 butterfly (R==2 and B==1) and even number of address bits

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1)

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (new)

Example mapping for stage s=0 of a transform for N=16 datapoints and one butterfly (B=1) (old)

TRASH/OLD Boundary condition, odd number of bits

Final equation for when nbits is odd and R==2 and B==1

TRASH/OLD Algorithm 2: Final equation(s) for R=2 and B=1

TRASH/OLD The ... REST ... of the story?

TRASH/OLD Extension to more than one butterfly unit (B > 1).

Generalization of Algorithm 1, extended for any B=2i and any number of address bits S: modMbitsS=0

everything from here down is wrong, i think; should probably instead copy and modify finished algorithm 1 from up above

⚠️ **GitHub.com Fallback** ⚠️

Generalization of Algorithm 1, extended for any B=2ⁱ and any number of address bits S: mod_MbitsS=0

⚠️ GitHub.com Fallback ⚠️