Fast multiplication routines - simondotm/stardot-wiki GitHub Wiki

Sample Code Library > Fast multiplication routines

Fast multiplication routines in 6502

This is a copy of an article I posted to the Beeb mailing list yonks ago - perhaps it's useful to someone...?

Multiplication by iteration and shifting

Normally, multiplication of two 8-bit numbers can be achieved with a little routine such as this:

.multiply

\ Multiplies A*X, and stores result in product/product+1

CPX #0:BEQ zero

DEX:STX product+1

LSR A:STA product

LDA #0

BCC s1:ADC product+1:.s1 ROR A:ROR product

BCC s2:ADC product+1:.s2 ROR A:ROR product

BCC s3:ADC product+1:.s3 ROR A:ROR product

BCC s4:ADC product+1:.s4 ROR A:ROR product

BCC s5:ADC product+1:.s5 ROR A:ROR product

BCC s6:ADC product+1:.s6 ROR A:ROR product

BCC s7:ADC product+1:.s7 ROR A:ROR product

BCC s8:ADC product+1:.s8 ROR A:ROR product

STA product+1

RTS

.zero

STX product:STX product+1

RTS

This completes in an average of 113 cycles (excluding the multiply by zero special case) - not bad, but it's possible to do better, provided you're happy to set aside some space for some tables...

Faster multiplication by table lookup

There is a method which yields the product of two values from the difference between two squares.

Mathematically:

 (a+b)^2 = a^2 + b^2 + 2ab     --> (I)

 (a-b)^2 = a^2 + b^2 - 2ab     --> (II)

(I) minus (II) gives:

 (a+b)^2 - (a-b)^2 = 4ab

or, in other words:

       (a+b)^2     (a-b)^2

 ab =  -------  -  -------

          4           4

So this means we can store a table of f(n) = n^2 / 4 for n = 0..510, and then can achieve multiplication via a single 16-bit subtract! In reality, we use 4 lookup tables; 2 for the LSB/MSBs of the 16-bit value when n is less than 256, and a further 2 for when n is 256 or greater.

We can divide by 4 without worrying about truncation, because we know that (a+b)^2 - (a-b)^2 will always be a multiple of 4, therefore we lose no information or accuracy by discarding the lower 2 bits in the table itself.

Here's some code:

 10 sqrlo256%=&900

 20 sqrlo512%=&A00

 30 sqrhi256%=&B00

 40 sqrhi512%=&C00

 50 :

 60 FOR N%=0 TO 255

 70 s256%=(N%*N%) DIV 4

 80 s512%=((N%+256)*(N%+256)) DIV 4

 90 N%?sqrlo256%=s256% AND 255

100 N%?sqrhi256%=s256% DIV 256

110 N%?sqrlo512%=s512% AND 255

120 N%?sqrhi512%=s512% DIV 256

130 NEXT

140 :

150 num1=&70:num2=&71:result=&72

160 FOR N%=0 TO 2 STEP 2:P%=&700:[OPT N%

170 .mult

180 SEC:LDA num1:SBC num2:BCS positive

190 EOR #255:ADC #1:.positive TAY

200 CLC:LDA num1:ADC num2:TAX

210 BCS morethan256:SEC

220 LDA sqrlo256%,X:SBC sqrlo256%,Y:STA result

230 LDA sqrhi256%,X:SBC sqrhi256%,Y:STA result+1

240 RTS

250 .morethan256

260 LDA sqrlo512%,X:SBC sqrlo256%,Y:STA result

270 LDA sqrhi512%,X:SBC sqrhi256%,Y:STA result+1

280 RTS

290 ]:NEXT

300 :

310 !result=0

320 REPEAT

330 ?num1=RND(256)-1:?num2=RND(256)-1:CALL mult

340 PRINT;?num1;"*";?num2;"=";!result;" (";?num1*?num2;")"

350 IF ?num1*?num2<>!result:END

360 UNTIL FALSE

So there's a routine which achieves an 8 bit * 8 bit multiplication in just 55 cycles - over twice the speed of the original routine. The only problem with this is the amount of table space required - 1k.

It turns out that we can reduce the table space by 256 bytes by unifying sqrlo256 and sqrlo512. They are virtually the same, apart from that sqrlo512(n) = sqrlo256(n) EOR &80 when n is odd.

Which leads us to the following space optimisation:

 10 sqrlo%=&900

 20 sqrhi256%=&A00

 30 sqrhi512%=&B00

 40 :

 50 FOR N%=0 TO 255

 60 s256%=(N%*N%) DIV 4

 70 s512%=((N%+256)*(N%+256)) DIV 4

 80 N%?sqrlo%=s256% AND 255

 90 N%?sqrhi256%=s256% DIV 256

100 N%?sqrhi512%=s512% DIV 256

110 NEXT

120 :

130 num1=&70:num2=&71:result=&72

140 FOR N%=0 TO 2 STEP 2:P%=&700:[OPT N%

150 .mult

160 SEC:LDA num1:SBC num2:BCS positive

170 EOR #255:ADC #1:.positive TAY

180 CLC:LDA num1:ADC num2:TAX

190 BCS morethan256

200 LDA sqrhi256%,X:STA result+1

210 LDA sqrlo%,X:SEC:BCS lessthan256

220 .morethan256

230 LDA sqrhi512%,X:STA result+1

240 TXA:AND #1:BEQ skip:LDA #&80:.skip

250 EOR sqrlo%,X:.lessthan256

260 SBC sqrlo%,Y:STA result

270 LDA result+1:SBC sqrhi256%,Y:STA result+1

280 RTS

290 ]:NEXT

300 :

310 !result=0

320 REPEAT

330 ?num1=RND(256)-1:?num2=RND(256)-1:CALL mult

340 PRINT;?num1;"*";?num2;"=";!result;" (";?num1*?num2;")"

350 IF ?num1*?num2<>!result:END

360 UNTIL FALSE

...which is an average of about 67 cycles, i.e. still substantially better than the original routine.

I wish I'd thought of this trick when I was writing multiplication-intensive demos years ago...

⚠️ **GitHub.com Fallback** ⚠️