MATH - rosco-pc/propeller-wiki GitHub Wiki

This page is still under construction

The propeller contains a 32 bit ALU with support for signed and unsigned basic arithmetic, a barrel shifter that makes all the difference and logic operations. Here it is assumed that you are familiar with the mnemonics used by the assembler, and that you know a bit of math. Some of these topics were covered in the "Propeller guts" document. (Note: all code was tested using pPropellerSim-and-in-the-case-of-BCD-math-an-actual-propeller,-too!), but bugs could exist, use at your own risk, and read the terms of the license.

Integer math, the four operations

Addition and subtraction are straightforward, multiplication and division require a bit more work.


                add   x,y   wc, wz

Assuming x and y contain already the values to add and those are of the unsigned type. The C flag will signal overflow, i.e. the result is bigger than 2³²-1, and the Z flag will signal a zero result.

For signed numbers adds perform the same operation. A signed number has a smaller range from 0 to 2³¹-1 on the positive side and -1 to -2³¹ on the negative side. So 1 + -1 will be zero (really?) and -3 + -4 will be -7 (unbelievable). As the range is smaller than unsigned, $7FFF_FFFF plus $7FFF_FFFF will give $7FFF_FFFE and will rise the C flag.


                adds   x,y   wc, wz

The C flag will signal again overflow if the result exceeds 2³¹-1 or -2³¹. The Z flag will signal again a zero result.

If you think that 32 bits is not enough you can concatenate several operations together using the C flag to extend the word size, so for 64 bit arguments we have (addx is the addition with carry version of add).


                add   xl,yl   wc
                addx  xh,yh   wc, wz

The addx instruction will use the carry from the first operation, on the lower 32 bits of the number, if any and add it to the upper 32 bits. C and Z work as before. If you need still more precision, more addx instructions can be chained as seen before.

Subtraction works in a similar manner using sub and subx:


                sub   xl,yl   wc
                subx  xh,yh   wc, wz

To multiply the propeller uses the good old addition and shift method due to the lack of a multiply instruction. But before that let us consider some special cases: multiplication by constants. As we know constants have the ability to conserve their value over time (!), so a fixed multiplication can save a few longs here and there. The propeller has a barrel-shifter that is essential for this to be smaller thatn using the normal multiplication depicted below. As we also know multiplication can be distributed across addition and that is the key to many common values:

For x10 = x2+x8 = (x+x4)*2


                shl   x,#1
                mov   r,x
                shl   x,#2
                add   r,x
or...
                mov   r,x
                add   r,x
                shl   x,#3
                add   r,x

For x80 = x16+x*64


                shl   x,#4
                mov   r,x
                shl   x,#2
                add   r,x

The source argument is x and r is the result. The propeller lacks a lea (load effective address) instruction so some neat tricks that can be exploited in x86 or 68k assembly, like multiplying by 3, 5 and 9 in one instruction, are out of the question.

The good old shift-add method works with two variables and a temporal register for counting (usable up to 16*16 bits):


                mov   r,#0
loop
                shr   y,#1   wc
        if_c    add   r,x
                shl   x,#1
                tjnz  y,#loop

This will only work if the lower part of a 1716 to 3232 bits is desired. Detection of overflow (r>2³²) requires that the overflow of the add instruction be honored and passed through, of course after the tjnz the status of the C flag must be checked.


                mov   r,#0
loop
                shr   y,#1   wc
        if_c    add   r,x    wc
                shl   x,#1
        if_nc   tjnz  y,#loop

If a full 64 bits result is needed (rh:rl)... well some changes are required:


                mov   rh,#0
                mov   rl,#0
                mov   xh,#0
loop
                shr   y,#1    wc
        if_nc   jmp   #loop2
                add   rl,xl   wc
                addx  rh,xh
loop2
                shl   xl,#1   wc
                rcl   xh,#1
                tjnz  y,#loop

Of course there are variations, this can be unrolled if we know how many effective bits one of the arguments has. If y is always smaller than x, it is better to test y against zero (tjnz instruction) than to test x. This can reduce the running time. These examples were coded for unsigned numbers, in the case of signed ones, the instruction abs, previous sign test could be used to produce the right result:


                mov   s,x     ' saves sign of x
                xor   s,y     ' calculates sign of result
                abs   x,x     ' calculates absolute value of x
                abs   y,y     ' of y ...
                mov   r,#0
loop
                shr   y,#1   wc
        if_c    add   r,x    wc
                shl   x,#1
        if_nc   tjnz  y,#loop

                mov   s,s wc ' sets C accordingly to the sign
                negc  r,r    ' negates result if necessary

The division requires a similar algorithm, but we subtract instead of add, x = x / y:


                mov   t,#16
                shl   y,#15
loop
                cmpsub x,y   wc
                rcl   x,#1
                djnz  t,#loop

The use of cmpsub reduces the amount of instructions per cycle loop to only 3, a nice bonus, if space is not a constraint these loops can be unrolled and up to 30% of time saved. If just 8 bits are used the first mov shoud be with 8 and the shift with 24.

Let's investigate cmpsub a bit more. As you know cmp is the sub instruction with the effect nr in place, do not write result back, to only affect flags. Flags as always must be explicitly indicated. cmp will rise C when the source is bigger than the destination. cmpsub will rise C if the source is smaller than the destination and will subtract the source to the destination placing the result into destination:


                cmpsub x,y   wc

x               long   5
y               long   7

Will not rise C neither will modify x.


                cmpsub x,y   wc

x               long   12
y               long   7

Will rise C and subtract y from x, resulting in a value of 5 in x. This instruction can be exploited in some instances when a sequence like this is found:


                cmp    x,y   wc
        if_nc   sub    x,y   wc

x               long   5
y               long   7

It may not be a big difference, but a long saved here and there help when there is not that many of them.

The barrel shifter

The propeller has some neat tricks as we saw with cmpsub. It packs some more, and the barrel shifter is one of them. Most small processors, i.e. 8-bit ones and some 16 bit (z80, H8/300, HC11, ...) have simple 1-bit shifters. You may be familiar with the typical: let's convert a binary to a hex string requiring some 4 shifts per high digit. The cog in the propeller is not your average 8 bit processor, it is a 32 bit one, and a modern one!, so a barrel shifter was included, like in any serious (ARM, SH) processor. This shifter can perform a shift with any number of bits between 1 and 31 in exactly the same amount of time, because the shifter actually... has no shift registers!!, it has some multiplexors instead. Shown routines for multiply and divide make use of this, shifts of 8 or 16 bits in one instruction. I cannot stress enough how useful this is. The behaviour of the carry flag could seem a bit awkward, but it has its motives, (the zero flag is always set if the result is zero). The C flag is set if the original first or 31st bit was 1 depending on if it was a left shift (31st) or right shift (first), and also independently of it was a 10 bit shift or a 1 bit shift.

Simple binary multiplications by powers of two can be implemented using the shifter. Note that adding a value to itself has the same effect as a 1-bit left shift (shl), being the same instruction is some architectures.

The instruction sar shifts right arithmetically, that means it takes care of the sign. If a number is negative it will remain negative, if it is positive it will be positive till it reaches zero. Very useful to sign extend a number:


                shl     x,#16         ' shifts left, sign (bit 15) becomes bit 31
                sar     x,#16         ' shifts conserving the bit 31 status, sign extending the number to 32 bits

NOTE: In the case where two's complement numbers are used and they are not 32 bit quantities, i.e. the sign is other than bit 31, the method described above can be used to sign extend the number. A subsequent and safe call to abs will return the absolute value of the number. A use of abs before the number is sign extended will lead to an unmodified number.

The C and Z flags

The two flags available can be tested and modified in almost all instructions, provided that the corresponding effects are in place. The carry flag, C, indicates, depending on the instruction, a number of different things, parity, bit set or reset, carry, borrow, etc. In comparison instructions it indicates borrow (mostly). Sometimes it could be useful to set or reset it.

To set C we can do:


                mov     x,#1
                shr     x,#1    wc

To clear C we can do, this will not modify x, but will reset C (actually it mirrors the status of the 31st bit of x after the move):


                mov     x,#0    wc nr

== ==

FFT

A working implementation, also written by me, of a Radix-2 FFT algorithm can be found here.

Fixed point math

An article about fixed point math can be found here.-(it-needs-some-more-examples-and-descriptions,-I'm-working-on-that).

Double precision binary floating point

The single precision binary floating point library from the obex is known to almost everyone, but when single precision is not enough, double precision can solve the problem with its 53 bits of significant providing up to 16 decimal digits. Having a 32 bit ALU, the propeller can compute these numbers quite fast, using unrolled loops for multiplication and division, around 2000 cycles are required for either function.

The format as per the standard is


  63 62  52 51                    0
+---+--...-+---------...-----------+
| s |  exp |     significant       |
+---+--...-+---------...-----------+

The significand sign is the bit 63, set when negative. The exponent is biased with the number 1023, that means that all numbers greater than 0 have an exponent of 1023 or bigger. The exponent 2047 is used to represent Infinities and NaNs (Not a number). Infinities have a zeroed significand while NaNs have a non zero. An exponent of zero means either that the number is zero or that a denormalized number is presented. The support for denormalized numbers (those where their bit 53 is zero) is somewhat lacking in the following routines. The significand is 53 bits long, but the most significant bit is assumed set when the exponent is non-zero and thus not stored.

Addition and subtraction are the simplest and fastest, adding 53 bits requires only 2 instructions, checking for bad input, scaling and sign management takes the rest. A possible implementation is as follows (LGPL v 2.0 code), see the link at the bottom for a file with all the routines.


' Adds two double precision numbers
' they should be already unpacked, result goes to R

dSUB           xor     rBSgn,cnt_h8     ' changes sign of B

dADD           test    flags,#FLG_NAN|FLG_INF   wz
       if_nz   call    #dLOADRNAN
       if_nz   jmp     #dADD_ret

               mov     rt1,rASgn
               xor     rt1,rBSgn   wz
       if_nz   jmp     #dSUB_1

               mov     rt1,rAExp
               subs    rt1,rBExp   wz
               abs     rt2,rt1
               mov     rRExp,rAExp
       if_z    jmp     #dADD_20

               cmp     rt1,#53   wc
       if_c    jmp     #dADD_5       ' shifts B
               cmp     rt2,#53   wc
       if_c    jmp     #dADD_10      ' shifts A
               cmp     rt1,#53   wc
       if_c    call    #dLOADBTOR
       if_nc   call    #dLOADATOR
               jmp     #dADD_ret

dADD_5         shr     rB,#1   wc
               rcr     rB1,#1
               djnz    rt2,#dADD_5
               jmp     #dADD_20

dADD_10        shr     rA,#1   wc
               rcr     rA1,#1
               djnz    rt2,#dADD_10
               mov     rRExp,rBExp

dADD_20        mov     rR1,rA1
               add     rR1,rB1   wc
               mov     rR,rA
               addx    rR,rB
               test    rR,cnt_bit54 wz
       if_nz   shr     rR,#1   wc
       if_nz   rcr     rR1,#1
       if_nz   add     rRExp,#1
               mov     rRSgn,rASgn

dADD_ret
dSUB_ret       ret

dSUB_1         mov     rt1,rAExp
               subs    rt1,rBExp   wz
               abs     rt2,rt1
        if_z   jmp     #dSUB_10    ' subs, no shift, exponents are equal
               cmp     rt1,#53   wc
        if_c   jmp     #dSUB_5    ' shifts B
               cmp     rt2,#53   wc
        if_c   jmp     #dSUB_5
               cmp     rt1,#53   wc
        if_c   call    #dLOADBTOR
        if_nc  call    #dLOADATOR
               jmp     #dSUB_ret

' exp of A is bigger than exp of B
dSUB_5         shr     rB,#1   wc
               rcr     rB1,#1
               djnz    rt2,#dSUB_5
' R=A-B
               mov     rRSgn,rASgn    ' transfers sign
               mov     rRExp,rAExp
               mov     rR1,rA1
               sub     rR1,rB1   wc, wz
               mov     rR,rA
               subx    rR,rB   wz
       if_nz   jmp     #dSUB_25
               jmp     #dSUB_35

' exponents are equal, so check significand
dSUB_10        cmp     rA1,rB1   wc, wz
               cmpx    rA,rB    wc, wz
    if_c       jmp     #dSUB_20          ' sig(A)<sig(B)
               jmp     #dSUB_35          ' numbers are equal

' B is bigger than A, we shift A and perform R=B-A
dSUB_15        shr     rA,#1   wc
               rcr     rA1,#1
               djnz    rt2,#dSUB_15
' R=B-A
dSUB_20        mov     rRSgn,rBSgn    ' transfers sign
               mov     rRExp,rBExp
               mov     rR1,rB1
               sub     rR1,rA1 wc
               mov     rR,rB
               subx    rR,rA

dSUB_25        mov     rt1,#53
dSUB_30        test    rR,cnt_bit53   wz
        if_nz  jmp     #dSUB_ret
               sub     rRExp,#1
               shl     rR1,#1   wc
               rcl     rR,#1
               djnz    rt1,#dSUB_30    ' normalizes
dSUB_35        call    #dLOADZTOR
               jmp     #dSUB_ret

Multplication requires a bit more work, a similar as that one explained before is used, but in two stages because a 96 bits partial result is kept instead of a full 107 when the most significant bits arer known to be zero. Exponents are added as usual and signs are xored.


' ********************************************************
' ****
' **** Multiplication R=A*B

dMUL           test    flags,#FLG_NAN|FLG_INF   wz
       if_nz   call    #dLOADRNAN
       if_nz   jmp     #dMUL_ret

               test    flags,#FLG_Z   wz
       if_z    call    #dLOADZTOR             ' if any of the numers are zero
       if_z    jmp     #dMUL_ret

               mov     rRExp,rAExp
               adds    rRExp,rBExp


' ** do not forget to check for overflow ;-)
               shl     rA,#11
               mov     rt1,rA1
               shl     rA1,#11
               shr     rt1,#21
               or      rA,rt1                 ' shifts to accomodate final product

               mov     rt4,#32                ' 32 bits first
               mov     rR,#0                  ' result significand
               mov     rR1,#0
               mov     rt1,#0
               mov     rt2,#0
               mov     rt3,#0
dMUL_10        shr     rB,#1   wc
               rcr     rB1,#1 wc
       if_nc   jmp     #dMUL_12
               add     rt2,rA1   wc
               addx    rt1,rA   wc
               addx    rR1,rt3   wc
dMUL_12        shl     rA1,#1   wc
               rcl     rA,#1  wc
               rcl     rt3,#1
               djnz    rt4,#dMUL_10
' 32 bits are done, now we multiply the other 21

               mov     rt2,#0
dMUL_15        shr     rB1,#1   wc
       if_nc   jmp     #dMUL_17
               add     rt1,rA   wc
               addx    rR1,rt3   wc
               addx    rR,rt2
dMUL_17        shl     rA,#1  wc
               rcl     rt3,#1 wc
               rcl     rt2,#1
               tjnz    rB1,#dMUL_15

               test    rR,cnt_bit53   wz
        if_nz  add     rRExp,#1                ' increments exponent
        if_z   shl     rt1,#1   wc
        if_z   rcl     rR1,#1   wc
        if_z   rcl     rR,#1

               shl     rt1,#1   wc             ' rounds up
               addx    rR1,#0   wc
               addx    rR,#0
               test    rR,cnt_bit54   wz
        if_nz  add     rRExp,#1                ' increments exponent
        if_nz  shr     rR,#1   wc
        if_nz  rcr     rR1,#1

dMUL_20        mov     rRSgn,rASgn
               xor     rRSgn,rBSgn
dMUL_ret       ret

Division needs also a loop, and a possible implementation follows:


' double division, R=A/B
'

dDIV           test    flags,#FLG_NAN|FLG_INF   wz
       if_nz   call    #dLOADRNAN
       if_nz   jmp     #dDIV_ret

               test    flags,#FLG_Z  wz
       if_z    jmp     #dDIV_2
               test    flags,#FLGB_Z   wz
       if_nz   or      flags,#FLG_ERR_DIV0    ' x/0 (even 0/0)
       if_z    call    #dLOADZTOR             ' 0/x = 0
               jmp     #dDIV_ret

dDIV_2         mov     rRExp,rAExp
               subs    rRExp,rBExp

               shl     rA,#11
               mov     rt1,rA1
               shl     rA1,#11
               shr     rt1,#21
               or      rA,rt1                 ' shifts to accomodate final product

               shl     rB,#11
               mov     rt1,rB1
               shl     rB1,#11
               shr     rt1,#21
               or      rB,rt1                 ' shifts to accomodate final product

               mov     rt1,#53
               cmp     rA1,rB1   wc, wz
               cmpx    rA,rB   wc, wz
       if_c    jmp     #dDIV_4

       if_nz   sub     rt1,#1
               sub     rA1,rB1   wc
               subx    rA,rB

dDIV_4         shr     rB,#1   wc
               rcr     rB1,#1
               sub     rRExp,#1

               mov     rR1,#0
               mov     rR,#0   wc
dDIV_5         cmp     rA1,rB1   wc,wz
               cmpx    rA,rB   wc
       if_c    jmp     #dDIV_10
               sub     rA1,rB1   wc
               subx    rA,rB   wc

dDIV_10        rcl     rR1,#1   wc
               rcl     rR,#1   wc
               shl     rA1,#1  wc
               rcl     rA,#1
               djnz    rt1,#dDIV_5

               xor     rR,cnt_f
               xor     rR1,cnt_f

dDIV_20        and     rR,cnt_sigh ' clears garbled bits

dDIV_ret       re

A comparsion routine could be implemented as follows:


' Comapres A and B, sets the flags accordingly.
' Two NaNs will give a non equal result

dCMP           test    flags,#FLG_NAN|FLG_INF   wz
       if_nz   mov     rt1,#2   wz, wc
       if_nz   jmp     #dCMP_ret   ' two NaNs or Infs give a diff result

               cmp     rBSgn,rASgn wz, wc
       if_nz   jmp     #dCMP_ret   ' different signs, neg < pos

               cmps    rAExp,rBExp   wz, wc
       if_nz   jmp     #dCMP_ret   ' they are different
               cmp     rA1,rB1   wz, wc
               cmpx    rA,rB   wz, wc
dCMP_ret       ret

Some support code is also neede, to pack and unpack numbers, to set flags and so on.


' loads A from pointer ptr and unpacks

dLOADA         andn    flags,#FLGA_Z|FLGA_INF|FLGA_NAN
               rdlong  rA,ptr
               add     ptr,#4
               mov     rAExp,rA
               rdlong  rA1,ptr
               mov     rASgn,rA
               and     rASgn,cnt_h8
               add     ptr,#4

               andn    rA,cnt_sexp   wz
       if_z    mov     rA1,rA1   wz
       if_z    or      flags,#FLGA_Z       ' temporal Z flag if significand is zero

               and     rAExp,cnt_exp
               cmp     rAExp,cnt_exp   wc ' checks for NaN or Inf
       if_nc   test    flags,#FLGA_Z   wz
   if_nc_and_z or      flags,#FLGA_INF     ' infinity
  if_nc_and_nz or      flags,#FLGA_NAN     ' Not a number

               shr     rAExp,#20   wz
       if_nz   or      rA,cnt_bit53       ' adds implicit 53 bit if not zero
       if_nz   sub     rAExp,cnt_bias     ' subtract bias if is not zero
       if_nz   andn    flags,#FLGA_Z       ' is not zero anymore
dLOADA_ret     ret

' loads B from pointer ptr and unpacks

dLOADB         andn    flags,#FLGB_Z|FLGB_INF|FLGB_NAN
               rdlong  rB,ptr
               add     ptr,#4
               mov     rBExp,rB
               rdlong  rB1,ptr
               mov     rBSgn,rB
               and     rBSgn,cnt_h8
               add     ptr,#4

               andn    rB,cnt_sexp   wz
       if_z    mov     rB1,rB1   wz
       if_z    or      flags,#FLGB_Z       ' temporal Z flag if significand is zero

               and     rBExp,cnt_exp
               cmp     rBExp,cnt_exp   wc ' checks for NaN or Inf
       if_nc   test    flags,#FLGB_Z   wz
   if_nc_and_z or      flags,#FLGB_INF     ' infinity
  if_nc_and_nz or      flags,#FLGB_NAN     ' Not a number

               shr     rBExp,#20   wz
       if_nz   or      rB,cnt_bit53       ' adds implicit 53 bit if not zero
       if_nz   sub     rBExp,cnt_bias     ' subtract bias if is not zero
       if_nz   andn    flags,#FLGB_Z       ' is not zero anymore
dLOADB_ret     ret

' packs and saves R to ptr, destroys rR, rRExp

dSAVER         andn    rR,cnt_bit53     ' removes imlicit 1 at bit 53
               or      rR,rRSgn
               test    rRExp,rRExp   wz
       if_nz   add     rRExp,cnt_bias   ' if it is zero, keep it zero
               shl     rRExp,#20
               or      rR,rRExp
               wrlong  rR,ptr
               add     ptr,#4
               wrlong  rR1,ptr
               add     ptr,#4
dSAVER_ret     ret

' loads NaN into R
' exp = 0x3ff
' non-zero significand

dLOADRNAN      mov     rRExp,cnt_NaN
               mov     rR,cnt_bit52
               mov     rR1,#0
dLOADRNAN_ret  ret
' loads A into R
dLOADATOR      mov     rR,rA
               mov     rR1,rA1
               mov     rRExp,rAExp
               mov     rRSgn,rASgn
dLOADATOR_ret  ret
' loads B into R
dLOADBTOR      mov     rR,rB
               mov     rR1,rB1
               mov     rRExp,rBExp
               mov     rRSgn,rBSgn
dLOADBTOR_ret  ret

' loads zero into R
dLOADZTOR      mov     rR,#0
               mov     rR1,#0
               mov     rRExp,#0
               mov     rRSgn,#0
dLOADZTOR_ret  ret


' Numbers are unpacked for easier manipulation
' explicit 53ed bit is added


rA             long    $10_0000
rA1            long    0'$5555_5555
rAExp          long    $3ff
rASgn          long    0

rB             long    $10_0000
rB1            long    0'$5555_5555
rBExp          long    $3ff
rBSgn          long    0

rR             long    0
rR1            long    0
rRExp          long    0
rRSgn          long    0

rt1            long    0
rt2            long    0
rt3            long    0
rt4            long    0

rt5            long    0

ptr            long    0
flags          long    0

cnt_h8         long    $8000_0000
cnt_exp        long    $7ff0_0000
cnt_sexp       long    $fff0_0000
cnt_bit54      long    $0020_0000
cnt_bit53      long    $0010_0000
cnt_bit52      long    $0008_0000
cnt_bias       long    $3ff
cnt_NaN        long    $7ff
cnt_f          long    $ffff_ffff
cnt_sigh       long    $1f_ffff

BCD MATH

As everyone knows BCD stands for binary coded decimal, i.e. each decimal digit is represented in binary, and all operations are done with decimal numbers... that means that from our stand point: exactly as we will do them on paper. BCD arithmetic has some caveats and some advantages compared to binary arithmetic:

Pros:

No rounding problems when working with human-readable numbers
Fastest floating-point to string conversion
I like it
BCD floating point

Cons

Slower
wastes space
poor support in the assembler

Number representation

To represent every decimal digit, a minimum of 4 bits are needed. That means: for 8 digits a long (32 bits) will be needed, while with a binary representation a long would hold 9 or 10 digits.


For a 8 digit number 12345678

+---+---+---+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
+---+---+---+---+---+---+---+---+

For the same number the binary would have been BC614E (24 bits instead of 32)

The four basic arithmetic operations are performed on BCD numbers as we will do them on paper, but the implementations are not that straightforward because carry/borrow from digit to digit has to be considered, something we hardly think about with binary numbers, i.e. the addition produces a number greater than 9. Binary operations have to be used because the propeller lacks BCD ones and daa (decimal adjust after addition as found in the Z80, HC11, etc) or similar. A compare and add/subtract if condition method has to be used operating on a digit-by-digit basis.


' Adds two 8 digit numbers (longs)
' carry is used in negative logic !

mADD8           mov     rcarry,#3
                shr     rcarry,#1 wc ' sets carry flag
mADD8C          mov     rmsk1,#$f
                mov     rt5,#0
                mov     rsh1,#10

mADD8_1         mov     rt3,rt1
                and     rt3,rmsk1

                mov     rt4,rt2
                and     rt4,rmsk1

        if_nc   add     rt4,rcarry
                add     rt3,rt4 wc
        if_nc   jmp     #mADD8_5
                mov     rcarry,#1 wc ' clears carry flag for next round
                jmp     #mADD8_ret

mADD8_5         cmp     rt3,rsh1 wc
        if_nc   sub     rt3,rsh1
                or      rt5,rt3
                rol     rcarry,#4 ' magic
                shl     rsh1,#4
                shl     rmsk1,#4 wz
        if_nz   jmp     #mADD8_1
mADD8C_ret
mADD8_ret       ret

The code operates over longs, i.e. 8 digits, rt1 is then added to rt2 with result on rt5. Several helping variables are then needed: rmsk1 is the digit mask that after every cycle is shifted left and also used as counter, rcarry is the number that will be added to the digit that is left of the current added, rt3 is the current digit from rt1 and rt4 is from rt2, rsh1 is the overflow value to compare with that is also shifted left after every cycle.

As you can see this small routine replaces the comfortable and short binary add :).

The subtraction could be implemented like this:


' Subs two 8 digit numbers (longs)

mSUB8           mov     rcarry,#1 wc ' clears carry flag
mSUB8C          mov     rmsk1,#$f
                mov     rt5,#0
                mov     rsh1,#10
mSUB8_1         mov     rt3,rt1
                and     rt3,rmsk1
                mov     rt4,rt2
                and     rt4,rmsk1
        if_c    add     rt4,rcarry
                sub     rt3,rt4 wc

        if_c    add     rt3,rsh1
                or      rt5,rt3
                rol     rcarry,#4 ' magic
                shl     rsh1,#4
                shl     rmsk1,#4 wz
        if_nz   jmp     #mSUB8_1
mSUB8C_ret
mSUB8_ret       ret

This routine uses the same tricks and variables as before but the C flag is used in positive logic instead.

Both routines can be concatenated calling mADD8/mSUB8 first and mADD8/mSUB8C later without modifying the C flag in between.

To multiply the method of shift-add as we will do by hand is one of the few resources left. Sadly, in comparison with binary multiplication, the short add has to be replaced with a call to mADD8, but the logic is similar. Here rA:rA1 is multiplied by rB1 using the pair rR:rR1 as result.


' multiplies A*rB1
'
' uses rt1, rt2, rt5, rt6, rcnt1, rp, rB1, rR, rR1
'
' clogged by mADD8 rt3, rt4, rt5, rcarry, rmsk1, rsh1, rR, rR1

mMUL8           mov     rt7,#8

mMUL8_5         mov     rcnt1,rB1
                and     rcnt1,#$f wz
                mov     rt6,#0
        if_z    jmp     #mMUL8_15

mMUL8_10        mov     rt1,rA1
                mov     rt2,rR1
                call    #mADD8
                mov     rR1,rt5
                mov     rt1,rA
                mov     rt2,rR
                call    #mADD8C
                mov     rR,rt5
        if_nc   add     rt6,#1          ' carry counter
                djnz    rcnt1,#mMUL8_10

mMUL8_15        mov     rt5,rR
                shl     rt5,#4
                shr     rR1,#4
                or      rR1,rt5         ' shift right rR:rR1
                ror     rt6,#4          ' convert to MSD
                or      rR,rt6          ' sets new carry digit
                shr     rB1,#4
                djnz    rt7,#mMUL8_5
mMUL8_ret       ret

What the small fragment of code does is to add rA:rA1 to itself as many times as the digits in rB1 from right to left say. mADD8 and mADD8C take care of adding rA:rA1 to rR:rR1. after each digit of rB1, rR:rR1 is shifted right to discard the rightmost digit and to make place for the new MSD.

All this may seem like a real waste, but opens the door to BCD-floating point, the real end.

Floating point BCD (BCD12)

To complete the package, we should talk about how to operate on whole floating point numbers. For that we will consider the following notation (which we will call BCD12 from now on), (more possibilities are of course available, and the principles explained here apply):


         +---+---+---+---+---+---+---+---+
long 0 : | S |MSD| A | 9 | 8 | 7 | 6 | 5 |
         +---+---+---+---+---+---+---+---+
         +---+---+---+---+---+---+---+---+
long 1 : | 4 | 3 | 2 | 1 |LSD| E2| E1| E0|
         +---+---+---+---+---+---+---+---+

The floating point number occupies 2 longs in HUB memory (4 in COG memory) and is packed according to the diagram above, each cell represents a nibble (4 bits). S is the significant sign, MSD to LSD are the significant digits, 12 in total, and the exponent occupies the last three nibbles. The exponent is a two's complement 12 bit number. Negative numbers represent negative powers of 10.

The number Zero is represented as two zeroed longs.

To better exploit the propeller capabilities, the HUB representation is unpacked to 4 longs in COG memory (it may seem a waste, but the access to the different parts in every routine saves many more longs than the 2 extra used for the unpacked representation):


Representation in cog's memory.
         +---+---+---+---+---+---+---+---+
long 0 : | 0 |MSD| A | 9 | 8 | 7 | 6 | 5 |  rA, rB, rR
         +---+---+---+---+---+---+---+---+
         +---+---+---+---+---+---+---+---+
long 1 : | 4 | 3 | 2 | 1 |LSD| 0 | 0 | 0 |  rA1, rB1, rR1
         +---+---+---+---+---+---+---+---+
         +---+---+---+---+---+---+---+---+
long 2 : | es| es| es| es| es| E2| E1| E0|  rAExp, rBExp, rRExp
         +---+---+---+---+---+---+---+---+
         +---+---+---+---+---+---+---+---+
long 3 : | S | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  rASgn, rBSgn, rRSgn
         +---+---+---+---+---+---+---+---+

As you can see this representation keeps the form of the packed version, for easier access. The exponent is sign extended with the shl/sar combination as seen in the load routines below. This allows for fast add/compare/subtract of exponents.

A conversion routine that loads rA:rA1, rAExp and rASgn could be this one, using ptr1 as source pointer:


' Loads A from a BCD12
LOADA           rdlong rA,ptr1        ' reads first long
                add    ptr1,#4
                mov    rASgn,rA
                rdlong rA1,ptr1        ' reads 2nd long
                and    rASgn,cnt_SMASK
                andn   rA,cnt_SMASK
                mov    rAExp,rA1
                shl    rAExp,#20    ' exponent is signed
                sar    rAExp,#20
                and    rA1,cnt_2LMASK
LOADA_ret       ret

cnt_SMASK       long    $8000_0000    ' sign mask
cnt_D12         long    $0f00_0000    ' MSD (digit 12) mask
cnt_2LMASK      long    $ffff_f000    ' low long mask

The packing of a result to HUB memory could be implemented as below using ptr1 as destination pointer:


' saves R as a BCD12
SAVER           mov    rt1,rR
                or     rt1,rRSgn
                mov    rt2,rRExp
                andn   rt2,cnt_2LMASK
                mov    rt3,rR1
                and    rt3,cnt_2LMASK
                wrlong rt1,ptr1
                or     rt2,rt3
                add    ptr1,#4
                wrlong rt2,ptr1
SAVER_ret       ret

The packing scheme now seems to fit nicely. The extra empty digit at the left of the MSD is used as guard digit during addition and multiplication, and thus plays to our advantage. The empty digits at the right of the LSD (cleared using the cnt_2LMASK) pose a performance penalty due to its computation in mADD8/mSUB8, the most significant of the group could be used for instance, to better round results.

From ASCII to BCD12

To convert a string to a BCD12 (or for that matter to binary floating point) a set of rules, like always, should be put into action. These rules help to determine what can be accepted as a valid input and what is not.

Spaces preceding the first valid character should be ignored. No spaces are allowed in between.
The only valid characters are digits 0 to 9, signs + and -, the period . and the letter e.
An optional significant sign, it should be the first valid character if present.
The next valid symbol is either a digit or the period.
A number of digits (any digit present beyond 12 will be chopped but they may add to the exponent if the number has an exponent greater than 12), if only a period was present a minimum of one digit must be present.
An exponent composed of three parts, the letter e to indicate it, an optional sign + or - and a minimum of 1 digit to a maximum of three digits.

All these numbers must be valid then:

-0.0012 123400000 0000123000 +45e-123 -0.0000004e+40 (why someone will write a number like this is beyond me)

Some invalid combination include:

-e+4 (No significant digit(s)) a1e-4 (invalid char) 4.5e- (No exponent digit(s)) . (No significant digit(s))

Will those rules we should be able to write some nice code. Note: The use of assembler for this purpose is... not recommendable as this routine will be not only complicated but over all, long, and not time critical at all. But as an exercise is a good one.


ASCIITOBCD

ASCIITOBCD_ret  ret

From BCD12 to ASCII

Conversion from BCD12 to ASCII is quite straightforward, but some points should be noted. To how many places do we have to represent the number ?, that is the question to be asked. Complementary of that are we going to use all the available range or just a small subset of numbers ?. As the propeller has a limited COG memory, a small and specifically tailored conversion may be the way to go.

Small and tailored

A small and tailored conversion can be seen as a quick-and-dirty approach, let's say that the numbers to represent are in the tens of thousands, decimals are unimportant, then we can truncate them without looking back. A possibility could be:


 ' Converts a number in the 10000 to 99999 range to ascii
 ' Number is in r, sign and decimals are unimportant
 ' ptr1 is destination
 '
TOASCIIQD       mov    rRExp,rRExp wc 'checks for negative exponent,
       if_c     mov    rt1,#48
       if_c     call   #EMITASCII
       if_c     jmp    #TOASCIIQD_40
' number may be in range
                cmp    rRExp,#5 wc
       if_nc    mov    rt1,#69 ' E signals error
       if_nc    call   #EMITASCII
       if_nc    jmp    #TOASCIIQD_40
 ' number is in range
                mov    rt2,rRExp
                add    rt2,#1
                mov    rt3,rR ' working significant
TOASCIIQD_10    mov    rt1,rt3
                shr    rt1,#24
                and    rt1,#15
                add    rt1,#48 ' converts digit to ASCII
                call   #EMITASCII
                shl    rt3,#4
                djnz   rt2,#TOASCIIQD_10

TOASCIIQD_40    mov    rt1,#0
                call   #EMITASCII
TOASCIIQD_ret   ret

' writes an ascii to HUB and increments pointer
' rt1 is the byte to write
' ptr1 is the pointer

EMITASCII       wrbyte rt1,ptr1
                add    ptr1,#1
EMITASCII_ret   ret

The helper routine EMITASCII does... well exactly that!, and increments the pointer. Small helping routines can save a few longs here and there, in some situations, but they can increase execution by 8 cycles each time they are called. So for simple and short loops it is the way to go.

If we were to consider rounding things may get... slower, and longer. The previous example can be taken as rounded with the floor function, or simply put: truncation. Rounding to nearest can be implemented adding 0.5. A simple call to mADD8 with the properly formatted argument can be used... something like this:


 ' Converts a number in the 10000 to 99999 range to ascii
 ' Number is in r, sign is unimportant. Rounding is performed if the number is >= 1
 ' ptr1 is destination
 '
TOASCIIQDR      mov    rRExp,rRExp wc 'checks for negative exponent,
        if_c    mov    rt1,#48
        if_c    call   #EMITASCII
        if_c    jmp    #TOASCIIQDR_40
 ' number may be in range
                cmp    rRExp,#5 wc
TOASCIIQDR_5
        if_nc   mov    rt1,#69 ' E signals error
        if_nc   call   #EMITASCII
        if_nc   jmp    #TOASCIIQDR_40
' number is in range, adds rounding factor
                mov    rt1,rR
                mov    rt2,#5 ' rounding argument
                mov    rt3,#5
                sub    rt3,rRExp
                shl    rt3,#2
                shl    rt2,rt3 ' adjust rounding digit
                call   #mADD8
                mov    rt2,rRExp
                test   rt5,cnt_MSD wz
        if_nz   add    rt2,#1 ' increments exponent if overflow
        if_nz   shr    rt5,#4 ' shifts working significant one to the right
                cmp    rt2,#5 wc
        if_nc   jmp    #TOASCIIQDR_5
                add    rt2,#1
TOASCIIQDR_10   mov    rt1,rt5
                shr    rt1,#24
                and    rt1,#15
                add    rt1,#48 ' converts digit to ASCII
                call   #EMITASCII
                shl    rt5,#4
                djnz   rt2,#TOASCIIQDR_10

TOASCIIQDR_40   mov    rt1,#0
                call   #EMITASCII
TOASCIIQDR_ret  ret

 ' writes an ascii to HUB and increments pointer
 ' rt1 is the byte to write
 ' ptr1 is the pointer

EMITASCII       wrbyte rt1,ptr1
                add    ptr1,#1
EMITASCII_ret   ret

If all numbers have to be converted, i.e. the full range at full precision, 12 digits, plus sign and exponent a maximum of 19 characters will be generated. Shorter representations, when applicable, are possible depending on the number. A good criteria is if the number of digits to represent is greater than say 12, a number with exponent should be used. It is easier to read 1234 than 1.234e+3. The last notation will be called scientific notation. A smart routine that can differentiate between this two cases has three parts. The first is the conversion to scientific if the exponent is greater than 12 in any direction.

Note: The BCD12 representation of numbers in following examples (the ones on the left) are written in scientific notation for clarity.

The second is the representation of numbers smaller than one, where zeroes should be inserted between the decimal point and the first digit of the significant, the number of zeroes in this case is the absolute value of the exponent minus 1.

The third and final case is that of numbers greater than 1 that have 12 or less digits, zeroes should be removed after the decimal point if there are no significant digit at the right, and zeroes should be inserted between the last digit at the right and the decimal point if required (this last case is the more complex one of the three).

Number	Representation	Comment
1.34e100	1.34e100	No prettier form available, exponent > 12
3.4e-1	.34
3.4e-5	.000034	In this case some zeroes where added after the decimal point. Those zeroes are not in the original BCD12 number, because they are all normalized
1.2e1	12	not like 1.2e1 or 12.0000000000
1.234e1	12.34
4.e+6	4000000	In this case zeroes where added after the last significant digit, the 4, and the decimal point.

Now, let's see some code:


TOASCII

TOASCII_ret     ret

Addition and subtraction of BCD12 numbers

The routine shown before allows for addition of two 8 BCD numbers. Based on this, and with some improvements for speed and size a successful implementation of a complete BCD12 plus BCD12 addition/subtraction would be something like this:


kkBCDSUB        xor     rBSgn,cnt_SMASK    ' I love long routines ;-)
kkBCDADD        mov     rt1,rASgn
                xor     rt1,rBSgn
                test    rt1,cnt_SMASK wz
        if_nz   jmp     #kkSUB
' falls to ADD15
' ******************************************
' ***
' *** addition of two bcd unpacked numbers rR = rA + rB

kkADD           mov     rt1,rAExp
                subs    rt1,rBExp wz
                abs     rt2,rt1
        if_z    jmp     #kkADD_20        ' adds, no shift
                cmp     rt1,#16 wc
        if_c    jmp     #kkADD_5        ' shifts B
                cmp     rt2,#16 wc
        if_c    jmp     #kkADD_10

                cmp     rt1,#16 wc
        if_c    call    #LOADBTOR
        if_nc   call    #LOADATOR
                jmp     #kkADD_ret

kkADD_5         call    #kkmSHRB15
                djnz    rt2,#kkADD_5
                mov     rRExp,rAExp          ' exponent of A
                jmp     #kkADD_20

kkADD_10        call    #kkmSHRA15
                djnz    rt2,#kkADD_10
                mov     rRExp,rBExp    ' exponent of B
kkADD_20        movs    kkmADD8_1,#rA1
                movs    kkmADD8_2,#rB1
                call    #kkmADDSM
                mov     rR1,rt6
                mov     rR,rt5
                and     rt5,cnt_MSD  wz
        if_nz   call    #kkmSHRR15
        if_nz   add     rRExp,#1
                mov     rRSgn,rASgn    ' sets sign from A
                jmp     #kkADD_ret

' ********************************************
' ***
' *** Substraction
' ***


kkSUB           mov     rt1,rAExp
                subs    rt1,rBExp wz
                movs    kkmSUB8_1,#rB1
                movs    kkmSUB8_2,#rA1         ' prepares for R=B-A
                abs     rt2,rt1
        if_z    jmp     #kkSUB_15    ' adds, no shift
                cmp     rt1,#16 wc
        if_c    jmp     #kkSUB_10    ' shifts B
                cmp     rt2,#16 wc
        if_c    jmp     #kkSUB_5
                cmp     rt1,#16 wc
        if_c    call    #LOADBTOR
        if_nc   call    #LOADATOR
                jmp     #kkSUB_ret

' B is bigger than A, we shift A and perform R=B-A
kkSUB_5         call    #kkmSHRA15
                djnz    rt2,#kkSUB_5
                jmp     #kkSUB_20

' exponents are equal, so check significand
kkSUB_15        call    #kkmCMP15
        if_c    jmp     #kkSUB_20          ' sig(A)<sig(B)
                jmp     #kkSUB_17
' A is bigger than B
kkSUB_10        call    #kkmSHRB15
                djnz    rt2,#kkSUB_10
kkSUB_17        movs    kkmSUB8_1,#rA1
                movs    kkmSUB8_2,#rB1
kkSUB_20
                mov     rRSgn,rASgn    ' transfers sign
                mov     rRExp,rAExp
                call    #kkmSUBSM
                mov     rR1,rt6
                mov     rR,rt5

kkSUB_25        and     rt5,cnt_D12 wz
        if_nz   jmp     #kkSUB_ret
                sub     rRExp,#1
                call    #kkmSHLR15
                call    #kkmCMPRZ    ' tests for zero
        if_nz   jmp     #kkSUB_25
                call    #LOADZTOR
kkSUB_30
kkADD_ret
kkBCDSUB_ret
kkBCDADD_ret
kkSUB_ret       ret

' Adds two numbers
kkmADDSM        call    #kkmADD8
                mov     rt6,rt5
                sub     kkmADD8_1,#1
                sub     kkmADD8_2,#1
                call    #kkmADD8C
kkmADDSM_ret    ret

kkmSUBSM        call    #kkmSUB8
                mov     rt6,rt5
                sub     kkmSUB8_1,#1
                sub     kkmSUB8_2,#1
                call    #kkmSUB8C
kkmSUBSM_ret    ret

' Adds two 8 digit longs
' carry is used in negative logic !

kkmADD8         mov     rcarry,#1   wc ' clears carry
                mov     rmsk1,#$f
kkmADD8C        mov     rt5,#0
                mov     rsh1,#10

kkmADD8_1       mov     rt3,0-0        '
                and     rt3,rmsk1

kkmADD8_2       mov     rt4,0-0
                and     rt4,rmsk1

         if_c   add     rt4,rcarry
                add     rt3,rt4   wc
         if_c   add     rt3,cnt_SIX          ' adds to convert to decimal if rightmost digit

kkmADD8_5 if_nc cmpsub  rt3,rsh1   wc
                or      rt5,rt3
                rol     rcarry,#4 ' magic
                rol     rmsk1,#4
                shl     rsh1,#4   wz
         if_nz  jmp     #kkmADD8_1
kkmADD8C_ret
kkmADD8_ret     ret

' Subs two 8 digit longs

kkmSUB8         mov     rcarry,#1   wc ' clrs carry flag
                mov     rmsk1,#$f
kkmSUB8C        mov     rt5,#0
                mov     rsh1,#10
kkmSUB8_1       mov     rt3,0-0
                and     rt3,rmsk1
kkmSUB8_2       mov     rt4,0-0
                and     rt4,rmsk1
       if_c     add     rt4,rcarry
                sub     rt3,rt4   wc

       if_c     add     rt3,rsh1
                or      rt5,rt3
                rol     rcarry,#4 ' magic
                rol     rmsk1,#4
                shl     rsh1,#4   wz
       if_nz    jmp     #kkmSUB8_1
kkmSUB8C_ret
kkmSUB8_ret     ret

' loads A to R
LOADATOR        mov    rR,rA
        mov    rR1,rA1
        mov    rRExp,rAExp
        mov    rRSgn,rBSgn
LOADATOR_ret    ret

' loads B to R
LOADBTOR        mov    rR,rB
        mov    rR1,rB1
        mov    rRExp,rBExp
        mov    rRSgn,rBSgn
LOADBTOR_ret    ret
' loads zero to R
LOADZTOR        mov    rR,#0
        mov    rR1,#0
        mov    rRExp,#0
        mov    rRSgn,#0
LOADZTOR_ret    ret

The subtraction is implemented wrapping the addition with a sign change for the subtraend. Note that there are actually three different stages. the first one represented by kkBCDADD (and kkBCDSUB) are the routines you should call when rA and rB (and the other related variables) have been loaded with the LOADA and LOADB routines described avobe. kkADD is called when the numbers (BCD12) should be added, i.e. when the sign of both are equal. To add the two parts then kkmADDSM is called. This last routine operates over rA:rA1 and rB:rB1 as if they where 16 digit numbers (ignoring the fact that some digits are always zero).

This code uses self-modifying techniques. This saves some longs used by variables and also some time. Compared to the previous routine kkmADD8 is 4 longs shorter and uses less variables, and the nice cmpsub instruction. The carry is now used in positive logic, i.e. a carry set means that the last add gave carry, (contrary to previous use).

Multiplication

Using a modified MUL8 routine, to support the self-modifying version of kkmADD8, a full multiplication ca be easily implemented as shown below.


' ********************************************************
' ****
' **** Multiplication R=A*B

kkBCDMUL        mov     rRExp,rAExp
                adds    rRExp,rBExp
' ** do not forget to check for overflow ;-)

                mov     rR,#0                  ' result significand
                mov     rR1,#0
                test    rB1,rB1 wz
        if_nz   call    #kkmMUL8               ' avoid 8 zeroes if necessary
kkBCDMUL_5      mov     rB1,rB
                call    #kkmMUL8

                call    #kkmSHLR15
                mov     rt1,rR
                and     rt1,cnt_D12 wz
        if_nz   add     rRExp,#1                ' increments exponent
        if_z    call    #kkmSHLR15              ' normalizes significand

                mov     rRSgn,rASgn
                xor     rRSgn,rBSgn
kkBCDMUL15_ret    ret

'
' multiplies A*rB1
'
' uses rt1, rt2, rt5, rt6, rcnt1, rp, rB1, rR, rR1
'
' clogged by mADD8 rt3, rt4, rt5, rcarry, rmsk1, rsh1
' clogged by mSHRR15 rt5, rR, rR1

kkmMUL8        mov     rt7,#8

kkmMUL8_5       mov     rcnt1,rB1
                and     rcnt1,#$f wz
                mov     rt6,#0
        if_z    jmp     #kkmMUL8_15

kkmMUL8_10      movs    kkmADD8_1,#rA1
                movs    kkmADD8_2,#rR1
                call    #kkmADD8
                mov     rR1,rt5
                movs    kkmADD8_1,#rA
                movs    kkmADD8_2,#rR
                call    #kkmADD8C
                mov     rR,rt5
        if_c    add     rt6,#1          ' carry counter
                djnz    rcnt1,#kkmMUL8_10

kkmMUL8_15      mov     rt5,#4
kkmMUL8_20      shr     rR,#1 wc
                rcr     rR1,#1
                djnz    rt5,#kkmMUL8_20
                ror     rt6,#4          ' convert to MSD
                or      rR,rt6          ' sets new carry digit
                shr     rB1,#4
                djnz    rt7,#kkmMUL8_5
kkmMUL8_ret     ret

Division

To divide we implement a similar algorithm as before, i.e. the same algorithm you learned at school. Some helping routines are necessary, shifts and compares. Let's see the code, but before we should note that x/0 and 0/0 will give some errors, in rerr.


.section COG cog0 ' needed for pPropellerSim (like DAT/org combination)
'
'
'
' BCD12 DIVISION
'
' A/B
' Destroys rA:rA1 rB:rB1

ERR_DIV0 = 1
ERR_DIV00 = 2


' ********************************************************
' ****
' ****
' **** Division R = A / B
' ****


kkBCDDIV        call     #kkmCMPBZ
        if_z    mov      rerr,#ERR_DIV0
        if_z    jmp      #kkBCDDIV_ret
                call     #kkmCMPAZ
        if_z    mov      rerr,#ERR_DIV00
        if_z    jmp      #kkBCDDIV_ret
' real division, exponent and sign
                mov      rR,#0
                mov      rR1,#0
                mov      rRExp,rAExp
                subs     rRExp,rBExp
                mov      rRSgn,rASgn
                xor      rASgn,rBSgn
                mov      rt7,#12    ' number of digits
                call     #kkmCMP15        ' compares A with B
        if_c    call     #kkmSHLA15
        if_c    subs     rRExp,#1        ' decrements exponent if A<B

kkBCDDIV_20     call     #kkmCMP15
        if_c    jmp      #kkBCDDIV_30
                movs     kkmSUB8_1,#rA1
                movs     kkmSUB8_2,#rB1
                call     #kkmSUB8
                mov      rA1,rt5
                movs     kkmSUB8_1,#rA
                movs     kkmSUB8_2,#rB
                call     #kkmSUB8C
                mov      rA,rt5
        if_nc   add      rR1,#1        ' increments count
                jmp      #kkBCDDIV_20

kkBCDDIV_30     call     #kkmSHLA15        ' if borrow, shift left
                call     #kkmSHLR15        ' shift for next digit
                djnz     rt7,#kkBCDDIV_20
                call     #kkmSHLR15
                call     #kkmSHLR15        ' adjusts result
kkBCDDIV_ret    ret

' shifts A left one digit
kkmSHLA15       mov     rt5,rA1
                shl     rA1,#4
                shl     rA,#4
                shr     rt5,#28
                or      rA,rt5
kkmSHLA15_ret   ret

' shifts A right one digit
kkmSHRA15       mov     rt5,rA
                shr     rA,#4
                shl     rt5,#28
                shr     rA1,#4
                or      rA1,rt5
kkmSHRA15_ret   ret

' shifts B left one digit
kkmSHLB15       mov      rt5,rB1
                shl      rB1,#4
                shl      rB,#4
                shr      rt5,#28
                or       rB,rt5
kkmSHLB15_ret   ret

' shifts B right one digit
kkmSHRB15       mov    rt5,rB
                shr    rB,#4
                shl    rt5,#28
                shr    rB1,#4
                or    rB1,rt5
kkmSHRB15_ret   ret

' shifts B left one digit
kkmSHLR15       mov      rt5,rR1
                shl      rR1,#4
                shl      rR,#4
                shr      rt5,#28
                or       rR,rt5
kkmSHLR15_ret   ret

' shifts R right one digit
kkmSHRR15       mov     rt5,rR
                shr     rR,#4
                shl     rt5,#28
                shr     rR1,#4
                or      rR1,rt5
kkmSHRR15_ret   ret

kkmCMP15        cmp     rA,rB   wc wz
        if_z    cmp     rA1,rB1   wc wz
kkmCMP15_ret    ret

kkmCMPRZ        test    rR,rR   wz
        if_z    test    rR1,rR1   wz
kkmCMPRZ_ret    ret

' checks if A or B are zero
kkmCMPAZ        test    rA,rA   wz
        if_z    test    rA1,rA1   wz
kkmCMPAZ_ret    ret

kkmCMPBZ        test    rB,rB   wz
        if_z    test    rB1,rB1   wz
kkmCMPBZ_ret    ret

cnt_SMASK       long    $8000_0000    ' sign mask
cnt_D12         long    $0f00_0000    ' MSD (digit 12) mask
cnt_2LMASK      long    $ffff_f000    ' low long mask

Square root

The calculus of the square root can be performed using a plurality of methods, while just some of them are useful for computers others are useful for calculation by hand. A modification of the hand method is shown below, implemented using some of the routines already described. The times were calculated using the corresponding arguments. The secret of the shorter calculatio time reside in the special shift routine kkmSHRRP. This routine will shift right only a part of a number using rt7 as index for this shift. It is implemented as two parts whether the shift occurs in the whole number or only on the right most long. For easier handling the significant is scaled by a factor of 5, making the multiplication by 20 (see hand algorithm at Wikipedia) unnecessary. The exponent is calculated using a simple dive by two in binary, because it is stored as a two's complement number.


' Calculates the square root of the argument in A
' As it is it takes new one, (old one was without self-modifying code):
'
' Input    cycles  cycles   result
'          old one new one
'      .78 158344  111192   0.883176086632
'     1.0    7012    6564   1.0
'     2.0  107940   76332   1.41421356237
'     5.0  169028  118560   2.23606797749
'    50.0  142408  100176   7.07106781186
'   100.0    7012    6564  10.0
'  1000.0  129128   90996   31.6227766016
' 1.3e+51  134440   94668   3.60555127546e+25
ERR_SQRN = 3

kkBCDSQR        call    #kkmCMPAZ
        if_z    call    #LOADZTOR
        if_z    jmp     #kkBCDSQR_ret       ' argument is zero

                cmp     rASgn,#0   wz
        if_nz   mov     rerr,#ERR_SQRN      ' argument is negative
        if_nz   jmp     #kkBCDSQR_ret

                movs    kkmADD8_1,#rA1
                movs    kkmADD8_2,#rA1        ' B+B
                call    #kkmADDSM
                mov     rB1,rt6
                mov     rB,rt5              ' B=A+A
                movs    kkmADD8_1,#rB1
                movs    kkmADD8_2,#rB1        ' B=4*A
                call    #kkmADDSM
                mov     rB1,rt6
                mov     rB,rt5
                movs    kkmADD8_1,#rB1
                movs    kkmADD8_2,#rA1        ' A=A+B
                call    #kkmADDSM             ' A=5*A
                mov     rA1,rt6
                mov     rA,rt5

                mov     rRExp,rAExp
                test    rAExp,#1   wz
        if_z    call    #kkmSHRA15            ' shift right if exponent was even

                mov     rR,cnt_FIVE
                mov     rR1,#0              ' rt6:rt7 is used to calculate the digits
                mov     rB,cnt_ONE
                mov     rB1,#0              ' we initialize constant
                mov     rt7,#12             ' 12 digits
kkBCDSQR_10     call    #kkmSHRRP
kkBCDSQR_17     cmp     rA,rR   wz wc
        if_z    cmp     rA1,rR1   wz wc

        if_c    jmp     #kkBCDSQR_20
' subtracts result
                movs    kkmSUB8_1,#rA1
                movs    kkmSUB8_2,#rR1
                call    #kkmSUBSM
                mov     rA1,rt6
                mov     rA,rt5
' adds one to result
                movs    kkmADD8_1,#rR1
                movs    kkmADD8_2,#rB1
                call    #kkmADDSM
                mov     rR1,rt6
                mov     rR,rt5
                jmp     #kkBCDSQR_17

kkBCDSQR_20     call    #kkmSHLA15              ' shifts left remainder
                call    #kkmSHRB15              ' shift right constant
kkBCDSQR_25     djnz    rt7,#kkBCDSQR_10

kkBCDSQR_ret    ret

' Rotate right with mask in rt7
kkmSHRRP        mov     rt4,cnt_ff            ' 0xffff_ffff
                mov     rt3,rt7               ' shift count
                cmpsub  rt3,#5   wc wr
                shl     rt3,#2
                shl     rt4,rt3               ' prepares mask
        if_c    jmp     #kkmSHRRP_20            ' we will see

                mov     rt2,rR1
                andn    rt2,rt4
                shr     rt2,#4
                and     rR1,rt4
                or      rR1,rt2
                jmp     #kkmSHRRP_ret

kkmSHRRP_20     shr     rR1,#4
                mov     rt2,rR
                shl     rt2,#28
                or      rR1,rt2               ' lower long ready
                mov     rt2,rR
                andn    rt2,rt4               ' right half of high word ready
                shr     rt2,#4
                and     rR,rt4
                or      rR,rt2
kkmSHRRP_ret    ret

Transcendentals

The need for floating point can be somewhat mitigated using for instance Fixed notation (a variant for binary floating point), but when transcendental functions are needed, it could be difficult to avoid its use. First of all we will consider several methods to implement some of the functions, angle functions, power and logarithm. With some basic identities, all possible functions can be obtained.

The first function to consider will be the sine function. As everyone knows there are a number of methods to calculate it, power series of several kinds (all related to the Taylor or McLaurin series) and the über-toll CORDIC methods. Floating point coprocessors calculate them using power series while pocket calculators on the other hand, (the HP series and TI series for example) use the CORDIC method, because they work with decimal numbers (BCD) not with binary numbers.

A Taylor series for the sine will be:


                 x³     x^5    x^7
   sin(x) = x - ---- + ---- - ---- + ... + E
                 3!     5!     7!

The error term will be the difference between the real value (with an infinite number of terms) and the one calculated with a reasonable number of terms. If 16 exact digits are to be calculated a 23 term series is needed. This means, storing 23 constants and doing 22 additions and 24 multiplications, using a simplified version (without Error term), for 4 terms there are 5 multiplications and 3 additions/subtractions:


                            1           1     x²
   sin(x) = x * (1 - x² * (--- + x² * (--- - ---)))
                            3!          5!    7!

When fast multiplication is available, this method could be used. In real life, some other considerations are taken, like angle reduction (using only 1 quadrant) and partial approximations using a table of pre-calculated constants.

The CORDIC method was developed, sadly, as an aid in missile guiding systems (read: to kill people). But despite that, some good came from it, as is it in widespread use for more edifying purposes, hopefully. It is based on an old method developed in the 17th century by mathematician Briggs. This method is based on algebraic functions, and can be thought out as an I infer the result working with the argument. There is no direct correlation between the argument and the result, because firstly an intermediate set of results has to be calculated to be used to calculate a result. (See Jacques Laporte's site)

Lets see it with a sine case:


  Soon to come

Note 1: This examples can be found in this file here-ready-to-be-tested-with-pPropellerSim-or-if-you-change-the-extension-and-add-a-wrapper...-a-ready-object-for-you-to-test-on-a-propeller.-Note-the-license-(LGPL-v2)!

Note 2: The double precision package is here.

All this is Copyright me :-), Pacito.Sys in accordance with the Creative Commons Share-Alike 3.0 License.