MATH - rosco-pc/propeller-wiki GitHub Wiki
This page is still under construction
The propeller contains a 32 bit ALU with support for signed and unsigned basic arithmetic, a barrel shifter that makes all the difference and logic operations. Here it is assumed that you are familiar with the mnemonics used by the assembler, and that you know a bit of math. Some of these topics were covered in the "Propeller guts" document. (Note: all code was tested using pPropellerSim-and-in-the-case-of-BCD-math-an-actual-propeller,-too!), but bugs could exist, use at your own risk, and read the terms of the license.
Integer math, the four operations
Addition and subtraction are straightforward, multiplication and division require a bit more work.
add x,y wc, wz
Assuming x and y contain already the values to add and those are of the unsigned type. The C flag will signal overflow, i.e. the result is bigger than 2³²-1, and the Z flag will signal a zero result.
For signed numbers adds perform the same operation. A signed number has a smaller range from 0 to 2³¹-1 on the positive side and -1 to -2³¹ on the negative side. So 1 + -1 will be zero (really?) and -3 + -4 will be -7 (unbelievable). As the range is smaller than unsigned, $7FFF_FFFF plus $7FFF_FFFF will give $7FFF_FFFE and will rise the C flag.
adds x,y wc, wz
The C flag will signal again overflow if the result exceeds 2³¹-1 or -2³¹. The Z flag will signal again a zero result.
If you think that 32 bits is not enough you can concatenate several operations together using the C flag to extend the word size, so for 64 bit arguments we have (addx is the addition with carry version of add).
add xl,yl wc
addx xh,yh wc, wz
The addx instruction will use the carry from the first operation, on the lower 32 bits of the number, if any and add it to the upper 32 bits. C and Z work as before. If you need still more precision, more addx instructions can be chained as seen before.
Subtraction works in a similar manner using sub and subx:
sub xl,yl wc
subx xh,yh wc, wz
To multiply the propeller uses the good old addition and shift method due to the lack of a multiply instruction. But before that let us consider some special cases: multiplication by constants. As we know constants have the ability to conserve their value over time (!), so a fixed multiplication can save a few longs here and there. The propeller has a barrel-shifter that is essential for this to be smaller thatn using the normal multiplication depicted below. As we also know multiplication can be distributed across addition and that is the key to many common values:
For x10 = x2+x8 = (x+x4)*2
shl x,#1
mov r,x
shl x,#2
add r,x
or...
mov r,x
add r,x
shl x,#3
add r,x
For x80 = x16+x*64
shl x,#4
mov r,x
shl x,#2
add r,x
The source argument is x and r is the result. The propeller lacks a lea (load effective address) instruction so some neat tricks that can be exploited in x86 or 68k assembly, like multiplying by 3, 5 and 9 in one instruction, are out of the question.
The good old shift-add method works with two variables and a temporal register for counting (usable up to 16*16 bits):
mov r,#0
loop
shr y,#1 wc
if_c add r,x
shl x,#1
tjnz y,#loop
This will only work if the lower part of a 1716 to 3232 bits is desired. Detection of overflow (r>2³²) requires that the overflow of the add instruction be honored and passed through, of course after the tjnz the status of the C flag must be checked.
mov r,#0
loop
shr y,#1 wc
if_c add r,x wc
shl x,#1
if_nc tjnz y,#loop
If a full 64 bits result is needed (rh:rl)... well some changes are required:
mov rh,#0
mov rl,#0
mov xh,#0
loop
shr y,#1 wc
if_nc jmp #loop2
add rl,xl wc
addx rh,xh
loop2
shl xl,#1 wc
rcl xh,#1
tjnz y,#loop
Of course there are variations, this can be unrolled if we know how many effective bits one of the arguments has. If y is always smaller than x, it is better to test y against zero (tjnz instruction) than to test x. This can reduce the running time. These examples were coded for unsigned numbers, in the case of signed ones, the instruction abs, previous sign test could be used to produce the right result:
mov s,x ' saves sign of x
xor s,y ' calculates sign of result
abs x,x ' calculates absolute value of x
abs y,y ' of y ...
mov r,#0
loop
shr y,#1 wc
if_c add r,x wc
shl x,#1
if_nc tjnz y,#loop
mov s,s wc ' sets C accordingly to the sign
negc r,r ' negates result if necessary
The division requires a similar algorithm, but we subtract instead of add, x = x / y:
mov t,#16
shl y,#15
loop
cmpsub x,y wc
rcl x,#1
djnz t,#loop
The use of cmpsub reduces the amount of instructions per cycle loop to only 3, a nice bonus, if space is not a constraint these loops can be unrolled and up to 30% of time saved. If just 8 bits are used the first mov shoud be with 8 and the shift with 24.
Let's investigate cmpsub a bit more. As you know cmp is the sub instruction with the effect nr in place, do not write result back, to only affect flags. Flags as always must be explicitly indicated. cmp will rise C when the source is bigger than the destination. cmpsub will rise C if the source is smaller than the destination and will subtract the source to the destination placing the result into destination:
cmpsub x,y wc
x long 5
y long 7
Will not rise C neither will modify x.
cmpsub x,y wc
x long 12
y long 7
Will rise C and subtract y from x, resulting in a value of 5 in x. This instruction can be exploited in some instances when a sequence like this is found:
cmp x,y wc
if_nc sub x,y wc
x long 5
y long 7
It may not be a big difference, but a long saved here and there help when there is not that many of them.
The barrel shifter
The propeller has some neat tricks as we saw with cmpsub. It packs some more, and the barrel shifter is one of them. Most small processors, i.e. 8-bit ones and some 16 bit (z80, H8/300, HC11, ...) have simple 1-bit shifters. You may be familiar with the typical: let's convert a binary to a hex string requiring some 4 shifts per high digit. The cog in the propeller is not your average 8 bit processor, it is a 32 bit one, and a modern one!, so a barrel shifter was included, like in any serious (ARM, SH) processor. This shifter can perform a shift with any number of bits between 1 and 31 in exactly the same amount of time, because the shifter actually... has no shift registers!!, it has some multiplexors instead. Shown routines for multiply and divide make use of this, shifts of 8 or 16 bits in one instruction. I cannot stress enough how useful this is. The behaviour of the carry flag could seem a bit awkward, but it has its motives, (the zero flag is always set if the result is zero). The C flag is set if the original first or 31st bit was 1 depending on if it was a left shift (31st) or right shift (first), and also independently of it was a 10 bit shift or a 1 bit shift.
Simple binary multiplications by powers of two can be implemented using the shifter. Note that adding a value to itself has the same effect as a 1-bit left shift (shl), being the same instruction is some architectures.
The instruction sar shifts right arithmetically, that means it takes care of the sign. If a number is negative it will remain negative, if it is positive it will be positive till it reaches zero. Very useful to sign extend a number:
shl x,#16 ' shifts left, sign (bit 15) becomes bit 31
sar x,#16 ' shifts conserving the bit 31 status, sign extending the number to 32 bits
NOTE: In the case where two's complement numbers are used and they are not 32 bit quantities, i.e. the sign is other than bit 31, the method described above can be used to sign extend the number. A subsequent and safe call to abs will return the absolute value of the number. A use of abs before the number is sign extended will lead to an unmodified number.
The C and Z flags
The two flags available can be tested and modified in almost all instructions, provided that the corresponding effects are in place. The carry flag, C, indicates, depending on the instruction, a number of different things, parity, bit set or reset, carry, borrow, etc. In comparison instructions it indicates borrow (mostly). Sometimes it could be useful to set or reset it.
To set C we can do:
mov x,#1
shr x,#1 wc
To clear C we can do, this will not modify x, but will reset C (actually it mirrors the status of the 31st bit of x after the move):
mov x,#0 wc nr
== ==
FFT
A working implementation, also written by me, of a Radix-2 FFT algorithm can be found here.
Fixed point math
An article about fixed point math can be found here.-(it-needs-some-more-examples-and-descriptions,-I'm-working-on-that).
Double precision binary floating point
The single precision binary floating point library from the obex is known to almost everyone, but when single precision is not enough, double precision can solve the problem with its 53 bits of significant providing up to 16 decimal digits. Having a 32 bit ALU, the propeller can compute these numbers quite fast, using unrolled loops for multiplication and division, around 2000 cycles are required for either function.
The format as per the standard is
63 62 52 51 0
+---+--...-+---------...-----------+
| s | exp | significant |
+---+--...-+---------...-----------+
The significand sign is the bit 63, set when negative. The exponent is biased with the number 1023, that means that all numbers greater than 0 have an exponent of 1023 or bigger. The exponent 2047 is used to represent Infinities and NaNs (Not a number). Infinities have a zeroed significand while NaNs have a non zero. An exponent of zero means either that the number is zero or that a denormalized number is presented. The support for denormalized numbers (those where their bit 53 is zero) is somewhat lacking in the following routines. The significand is 53 bits long, but the most significant bit is assumed set when the exponent is non-zero and thus not stored.
Addition and subtraction are the simplest and fastest, adding 53 bits requires only 2 instructions, checking for bad input, scaling and sign management takes the rest. A possible implementation is as follows (LGPL v 2.0 code), see the link at the bottom for a file with all the routines.
' Adds two double precision numbers
' they should be already unpacked, result goes to R
dSUB xor rBSgn,cnt_h8 ' changes sign of B
dADD test flags,#FLG_NAN|FLG_INF wz
if_nz call #dLOADRNAN
if_nz jmp #dADD_ret
mov rt1,rASgn
xor rt1,rBSgn wz
if_nz jmp #dSUB_1
mov rt1,rAExp
subs rt1,rBExp wz
abs rt2,rt1
mov rRExp,rAExp
if_z jmp #dADD_20
cmp rt1,#53 wc
if_c jmp #dADD_5 ' shifts B
cmp rt2,#53 wc
if_c jmp #dADD_10 ' shifts A
cmp rt1,#53 wc
if_c call #dLOADBTOR
if_nc call #dLOADATOR
jmp #dADD_ret
dADD_5 shr rB,#1 wc
rcr rB1,#1
djnz rt2,#dADD_5
jmp #dADD_20
dADD_10 shr rA,#1 wc
rcr rA1,#1
djnz rt2,#dADD_10
mov rRExp,rBExp
dADD_20 mov rR1,rA1
add rR1,rB1 wc
mov rR,rA
addx rR,rB
test rR,cnt_bit54 wz
if_nz shr rR,#1 wc
if_nz rcr rR1,#1
if_nz add rRExp,#1
mov rRSgn,rASgn
dADD_ret
dSUB_ret ret
dSUB_1 mov rt1,rAExp
subs rt1,rBExp wz
abs rt2,rt1
if_z jmp #dSUB_10 ' subs, no shift, exponents are equal
cmp rt1,#53 wc
if_c jmp #dSUB_5 ' shifts B
cmp rt2,#53 wc
if_c jmp #dSUB_5
cmp rt1,#53 wc
if_c call #dLOADBTOR
if_nc call #dLOADATOR
jmp #dSUB_ret
' exp of A is bigger than exp of B
dSUB_5 shr rB,#1 wc
rcr rB1,#1
djnz rt2,#dSUB_5
' R=A-B
mov rRSgn,rASgn ' transfers sign
mov rRExp,rAExp
mov rR1,rA1
sub rR1,rB1 wc, wz
mov rR,rA
subx rR,rB wz
if_nz jmp #dSUB_25
jmp #dSUB_35
' exponents are equal, so check significand
dSUB_10 cmp rA1,rB1 wc, wz
cmpx rA,rB wc, wz
if_c jmp #dSUB_20 ' sig(A)<sig(B)
jmp #dSUB_35 ' numbers are equal
' B is bigger than A, we shift A and perform R=B-A
dSUB_15 shr rA,#1 wc
rcr rA1,#1
djnz rt2,#dSUB_15
' R=B-A
dSUB_20 mov rRSgn,rBSgn ' transfers sign
mov rRExp,rBExp
mov rR1,rB1
sub rR1,rA1 wc
mov rR,rB
subx rR,rA
dSUB_25 mov rt1,#53
dSUB_30 test rR,cnt_bit53 wz
if_nz jmp #dSUB_ret
sub rRExp,#1
shl rR1,#1 wc
rcl rR,#1
djnz rt1,#dSUB_30 ' normalizes
dSUB_35 call #dLOADZTOR
jmp #dSUB_ret
Multplication requires a bit more work, a similar as that one explained before is used, but in two stages because a 96 bits partial result is kept instead of a full 107 when the most significant bits arer known to be zero. Exponents are added as usual and signs are xored.
' ********************************************************
' ****
' **** Multiplication R=A*B
dMUL test flags,#FLG_NAN|FLG_INF wz
if_nz call #dLOADRNAN
if_nz jmp #dMUL_ret
test flags,#FLG_Z wz
if_z call #dLOADZTOR ' if any of the numers are zero
if_z jmp #dMUL_ret
mov rRExp,rAExp
adds rRExp,rBExp
' ** do not forget to check for overflow ;-)
shl rA,#11
mov rt1,rA1
shl rA1,#11
shr rt1,#21
or rA,rt1 ' shifts to accomodate final product
mov rt4,#32 ' 32 bits first
mov rR,#0 ' result significand
mov rR1,#0
mov rt1,#0
mov rt2,#0
mov rt3,#0
dMUL_10 shr rB,#1 wc
rcr rB1,#1 wc
if_nc jmp #dMUL_12
add rt2,rA1 wc
addx rt1,rA wc
addx rR1,rt3 wc
dMUL_12 shl rA1,#1 wc
rcl rA,#1 wc
rcl rt3,#1
djnz rt4,#dMUL_10
' 32 bits are done, now we multiply the other 21
mov rt2,#0
dMUL_15 shr rB1,#1 wc
if_nc jmp #dMUL_17
add rt1,rA wc
addx rR1,rt3 wc
addx rR,rt2
dMUL_17 shl rA,#1 wc
rcl rt3,#1 wc
rcl rt2,#1
tjnz rB1,#dMUL_15
test rR,cnt_bit53 wz
if_nz add rRExp,#1 ' increments exponent
if_z shl rt1,#1 wc
if_z rcl rR1,#1 wc
if_z rcl rR,#1
shl rt1,#1 wc ' rounds up
addx rR1,#0 wc
addx rR,#0
test rR,cnt_bit54 wz
if_nz add rRExp,#1 ' increments exponent
if_nz shr rR,#1 wc
if_nz rcr rR1,#1
dMUL_20 mov rRSgn,rASgn
xor rRSgn,rBSgn
dMUL_ret ret
Division needs also a loop, and a possible implementation follows:
' double division, R=A/B
'
dDIV test flags,#FLG_NAN|FLG_INF wz
if_nz call #dLOADRNAN
if_nz jmp #dDIV_ret
test flags,#FLG_Z wz
if_z jmp #dDIV_2
test flags,#FLGB_Z wz
if_nz or flags,#FLG_ERR_DIV0 ' x/0 (even 0/0)
if_z call #dLOADZTOR ' 0/x = 0
jmp #dDIV_ret
dDIV_2 mov rRExp,rAExp
subs rRExp,rBExp
shl rA,#11
mov rt1,rA1
shl rA1,#11
shr rt1,#21
or rA,rt1 ' shifts to accomodate final product
shl rB,#11
mov rt1,rB1
shl rB1,#11
shr rt1,#21
or rB,rt1 ' shifts to accomodate final product
mov rt1,#53
cmp rA1,rB1 wc, wz
cmpx rA,rB wc, wz
if_c jmp #dDIV_4
if_nz sub rt1,#1
sub rA1,rB1 wc
subx rA,rB
dDIV_4 shr rB,#1 wc
rcr rB1,#1
sub rRExp,#1
mov rR1,#0
mov rR,#0 wc
dDIV_5 cmp rA1,rB1 wc,wz
cmpx rA,rB wc
if_c jmp #dDIV_10
sub rA1,rB1 wc
subx rA,rB wc
dDIV_10 rcl rR1,#1 wc
rcl rR,#1 wc
shl rA1,#1 wc
rcl rA,#1
djnz rt1,#dDIV_5
xor rR,cnt_f
xor rR1,cnt_f
dDIV_20 and rR,cnt_sigh ' clears garbled bits
dDIV_ret re
A comparsion routine could be implemented as follows:
' Comapres A and B, sets the flags accordingly.
' Two NaNs will give a non equal result
dCMP test flags,#FLG_NAN|FLG_INF wz
if_nz mov rt1,#2 wz, wc
if_nz jmp #dCMP_ret ' two NaNs or Infs give a diff result
cmp rBSgn,rASgn wz, wc
if_nz jmp #dCMP_ret ' different signs, neg < pos
cmps rAExp,rBExp wz, wc
if_nz jmp #dCMP_ret ' they are different
cmp rA1,rB1 wz, wc
cmpx rA,rB wz, wc
dCMP_ret ret
Some support code is also neede, to pack and unpack numbers, to set flags and so on.
' loads A from pointer ptr and unpacks
dLOADA andn flags,#FLGA_Z|FLGA_INF|FLGA_NAN
rdlong rA,ptr
add ptr,#4
mov rAExp,rA
rdlong rA1,ptr
mov rASgn,rA
and rASgn,cnt_h8
add ptr,#4
andn rA,cnt_sexp wz
if_z mov rA1,rA1 wz
if_z or flags,#FLGA_Z ' temporal Z flag if significand is zero
and rAExp,cnt_exp
cmp rAExp,cnt_exp wc ' checks for NaN or Inf
if_nc test flags,#FLGA_Z wz
if_nc_and_z or flags,#FLGA_INF ' infinity
if_nc_and_nz or flags,#FLGA_NAN ' Not a number
shr rAExp,#20 wz
if_nz or rA,cnt_bit53 ' adds implicit 53 bit if not zero
if_nz sub rAExp,cnt_bias ' subtract bias if is not zero
if_nz andn flags,#FLGA_Z ' is not zero anymore
dLOADA_ret ret
' loads B from pointer ptr and unpacks
dLOADB andn flags,#FLGB_Z|FLGB_INF|FLGB_NAN
rdlong rB,ptr
add ptr,#4
mov rBExp,rB
rdlong rB1,ptr
mov rBSgn,rB
and rBSgn,cnt_h8
add ptr,#4
andn rB,cnt_sexp wz
if_z mov rB1,rB1 wz
if_z or flags,#FLGB_Z ' temporal Z flag if significand is zero
and rBExp,cnt_exp
cmp rBExp,cnt_exp wc ' checks for NaN or Inf
if_nc test flags,#FLGB_Z wz
if_nc_and_z or flags,#FLGB_INF ' infinity
if_nc_and_nz or flags,#FLGB_NAN ' Not a number
shr rBExp,#20 wz
if_nz or rB,cnt_bit53 ' adds implicit 53 bit if not zero
if_nz sub rBExp,cnt_bias ' subtract bias if is not zero
if_nz andn flags,#FLGB_Z ' is not zero anymore
dLOADB_ret ret
' packs and saves R to ptr, destroys rR, rRExp
dSAVER andn rR,cnt_bit53 ' removes imlicit 1 at bit 53
or rR,rRSgn
test rRExp,rRExp wz
if_nz add rRExp,cnt_bias ' if it is zero, keep it zero
shl rRExp,#20
or rR,rRExp
wrlong rR,ptr
add ptr,#4
wrlong rR1,ptr
add ptr,#4
dSAVER_ret ret
' loads NaN into R
' exp = 0x3ff
' non-zero significand
dLOADRNAN mov rRExp,cnt_NaN
mov rR,cnt_bit52
mov rR1,#0
dLOADRNAN_ret ret
' loads A into R
dLOADATOR mov rR,rA
mov rR1,rA1
mov rRExp,rAExp
mov rRSgn,rASgn
dLOADATOR_ret ret
' loads B into R
dLOADBTOR mov rR,rB
mov rR1,rB1
mov rRExp,rBExp
mov rRSgn,rBSgn
dLOADBTOR_ret ret
' loads zero into R
dLOADZTOR mov rR,#0
mov rR1,#0
mov rRExp,#0
mov rRSgn,#0
dLOADZTOR_ret ret
' Numbers are unpacked for easier manipulation
' explicit 53ed bit is added
rA long $10_0000
rA1 long 0'$5555_5555
rAExp long $3ff
rASgn long 0
rB long $10_0000
rB1 long 0'$5555_5555
rBExp long $3ff
rBSgn long 0
rR long 0
rR1 long 0
rRExp long 0
rRSgn long 0
rt1 long 0
rt2 long 0
rt3 long 0
rt4 long 0
rt5 long 0
ptr long 0
flags long 0
cnt_h8 long $8000_0000
cnt_exp long $7ff0_0000
cnt_sexp long $fff0_0000
cnt_bit54 long $0020_0000
cnt_bit53 long $0010_0000
cnt_bit52 long $0008_0000
cnt_bias long $3ff
cnt_NaN long $7ff
cnt_f long $ffff_ffff
cnt_sigh long $1f_ffff
BCD MATH
As everyone knows BCD stands for binary coded decimal, i.e. each decimal digit is represented in binary, and all operations are done with decimal numbers... that means that from our stand point: exactly as we will do them on paper. BCD arithmetic has some caveats and some advantages compared to binary arithmetic:
Pros:
- No rounding problems when working with human-readable numbers
- Fastest floating-point to string conversion
- I like it
- BCD floating point
Cons
- Slower
- wastes space
- poor support in the assembler
Number representation
To represent every decimal digit, a minimum of 4 bits are needed. That means: for 8 digits a long (32 bits) will be needed, while with a binary representation a long would hold 9 or 10 digits.
For a 8 digit number 12345678
+---+---+---+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
+---+---+---+---+---+---+---+---+
For the same number the binary would have been BC614E (24 bits instead of 32)
The four basic arithmetic operations are performed on BCD numbers as we will do them on paper, but the implementations are not that straightforward because carry/borrow from digit to digit has to be considered, something we hardly think about with binary numbers, i.e. the addition produces a number greater than 9. Binary operations have to be used because the propeller lacks BCD ones and daa (decimal adjust after addition as found in the Z80, HC11, etc) or similar. A compare and add/subtract if condition method has to be used operating on a digit-by-digit basis.
' Adds two 8 digit numbers (longs)
' carry is used in negative logic !
mADD8 mov rcarry,#3
shr rcarry,#1 wc ' sets carry flag
mADD8C mov rmsk1,#$f
mov rt5,#0
mov rsh1,#10
mADD8_1 mov rt3,rt1
and rt3,rmsk1
mov rt4,rt2
and rt4,rmsk1
if_nc add rt4,rcarry
add rt3,rt4 wc
if_nc jmp #mADD8_5
mov rcarry,#1 wc ' clears carry flag for next round
jmp #mADD8_ret
mADD8_5 cmp rt3,rsh1 wc
if_nc sub rt3,rsh1
or rt5,rt3
rol rcarry,#4 ' magic
shl rsh1,#4
shl rmsk1,#4 wz
if_nz jmp #mADD8_1
mADD8C_ret
mADD8_ret ret
The code operates over longs, i.e. 8 digits, rt1 is then added to rt2 with result on rt5. Several helping variables are then needed: rmsk1 is the digit mask that after every cycle is shifted left and also used as counter, rcarry is the number that will be added to the digit that is left of the current added, rt3 is the current digit from rt1 and rt4 is from rt2, rsh1 is the overflow value to compare with that is also shifted left after every cycle.
As you can see this small routine replaces the comfortable and short binary add :).
The subtraction could be implemented like this:
' Subs two 8 digit numbers (longs)
mSUB8 mov rcarry,#1 wc ' clears carry flag
mSUB8C mov rmsk1,#$f
mov rt5,#0
mov rsh1,#10
mSUB8_1 mov rt3,rt1
and rt3,rmsk1
mov rt4,rt2
and rt4,rmsk1
if_c add rt4,rcarry
sub rt3,rt4 wc
if_c add rt3,rsh1
or rt5,rt3
rol rcarry,#4 ' magic
shl rsh1,#4
shl rmsk1,#4 wz
if_nz jmp #mSUB8_1
mSUB8C_ret
mSUB8_ret ret
This routine uses the same tricks and variables as before but the C flag is used in positive logic instead.
Both routines can be concatenated calling mADD8/mSUB8 first and mADD8/mSUB8C later without modifying the C flag in between.
To multiply the method of shift-add as we will do by hand is one of the few resources left. Sadly, in comparison with binary multiplication, the short add has to be replaced with a call to mADD8, but the logic is similar. Here rA:rA1 is multiplied by rB1 using the pair rR:rR1 as result.
' multiplies A*rB1
'
' uses rt1, rt2, rt5, rt6, rcnt1, rp, rB1, rR, rR1
'
' clogged by mADD8 rt3, rt4, rt5, rcarry, rmsk1, rsh1, rR, rR1
mMUL8 mov rt7,#8
mMUL8_5 mov rcnt1,rB1
and rcnt1,#$f wz
mov rt6,#0
if_z jmp #mMUL8_15
mMUL8_10 mov rt1,rA1
mov rt2,rR1
call #mADD8
mov rR1,rt5
mov rt1,rA
mov rt2,rR
call #mADD8C
mov rR,rt5
if_nc add rt6,#1 ' carry counter
djnz rcnt1,#mMUL8_10
mMUL8_15 mov rt5,rR
shl rt5,#4
shr rR1,#4
or rR1,rt5 ' shift right rR:rR1
ror rt6,#4 ' convert to MSD
or rR,rt6 ' sets new carry digit
shr rB1,#4
djnz rt7,#mMUL8_5
mMUL8_ret ret
What the small fragment of code does is to add rA:rA1 to itself as many times as the digits in rB1 from right to left say. mADD8 and mADD8C take care of adding rA:rA1 to rR:rR1. after each digit of rB1, rR:rR1 is shifted right to discard the rightmost digit and to make place for the new MSD.
All this may seem like a real waste, but opens the door to BCD-floating point, the real end.
Floating point BCD (BCD12)
To complete the package, we should talk about how to operate on whole floating point numbers. For that we will consider the following notation (which we will call BCD12 from now on), (more possibilities are of course available, and the principles explained here apply):
+---+---+---+---+---+---+---+---+
long 0 : | S |MSD| A | 9 | 8 | 7 | 6 | 5 |
+---+---+---+---+---+---+---+---+
+---+---+---+---+---+---+---+---+
long 1 : | 4 | 3 | 2 | 1 |LSD| E2| E1| E0|
+---+---+---+---+---+---+---+---+
The floating point number occupies 2 longs in HUB memory (4 in COG memory) and is packed according to the diagram above, each cell represents a nibble (4 bits). S is the significant sign, MSD to LSD are the significant digits, 12 in total, and the exponent occupies the last three nibbles. The exponent is a two's complement 12 bit number. Negative numbers represent negative powers of 10.
The number Zero is represented as two zeroed longs.
To better exploit the propeller capabilities, the HUB representation is unpacked to 4 longs in COG memory (it may seem a waste, but the access to the different parts in every routine saves many more longs than the 2 extra used for the unpacked representation):
Representation in cog's memory.
+---+---+---+---+---+---+---+---+
long 0 : | 0 |MSD| A | 9 | 8 | 7 | 6 | 5 | rA, rB, rR
+---+---+---+---+---+---+---+---+
+---+---+---+---+---+---+---+---+
long 1 : | 4 | 3 | 2 | 1 |LSD| 0 | 0 | 0 | rA1, rB1, rR1
+---+---+---+---+---+---+---+---+
+---+---+---+---+---+---+---+---+
long 2 : | es| es| es| es| es| E2| E1| E0| rAExp, rBExp, rRExp
+---+---+---+---+---+---+---+---+
+---+---+---+---+---+---+---+---+
long 3 : | S | 0 | 0 | 0 | 0 | 0 | 0 | 0 | rASgn, rBSgn, rRSgn
+---+---+---+---+---+---+---+---+
As you can see this representation keeps the form of the packed version, for easier access. The exponent is sign extended with the shl/sar combination as seen in the load routines below. This allows for fast add/compare/subtract of exponents.
A conversion routine that loads rA:rA1, rAExp and rASgn could be this one, using ptr1 as source pointer:
' Loads A from a BCD12
LOADA rdlong rA,ptr1 ' reads first long
add ptr1,#4
mov rASgn,rA
rdlong rA1,ptr1 ' reads 2nd long
and rASgn,cnt_SMASK
andn rA,cnt_SMASK
mov rAExp,rA1
shl rAExp,#20 ' exponent is signed
sar rAExp,#20
and rA1,cnt_2LMASK
LOADA_ret ret
cnt_SMASK long $8000_0000 ' sign mask
cnt_D12 long $0f00_0000 ' MSD (digit 12) mask
cnt_2LMASK long $ffff_f000 ' low long mask
The packing of a result to HUB memory could be implemented as below using ptr1 as destination pointer:
' saves R as a BCD12
SAVER mov rt1,rR
or rt1,rRSgn
mov rt2,rRExp
andn rt2,cnt_2LMASK
mov rt3,rR1
and rt3,cnt_2LMASK
wrlong rt1,ptr1
or rt2,rt3
add ptr1,#4
wrlong rt2,ptr1
SAVER_ret ret
The packing scheme now seems to fit nicely. The extra empty digit at the left of the MSD is used as guard digit during addition and multiplication, and thus plays to our advantage. The empty digits at the right of the LSD (cleared using the cnt_2LMASK) pose a performance penalty due to its computation in mADD8/mSUB8, the most significant of the group could be used for instance, to better round results.
From ASCII to BCD12
To convert a string to a BCD12 (or for that matter to binary floating point) a set of rules, like always, should be put into action. These rules help to determine what can be accepted as a valid input and what is not.
- Spaces preceding the first valid character should be ignored. No spaces are allowed in between.
- The only valid characters are digits 0 to 9, signs + and -, the period . and the letter e.
- An optional significant sign, it should be the first valid character if present.
- The next valid symbol is either a digit or the period.
- A number of digits (any digit present beyond 12 will be chopped but they may add to the exponent if the number has an exponent greater than 12), if only a period was present a minimum of one digit must be present.
- An exponent composed of three parts, the letter e to indicate it, an optional sign + or - and a minimum of 1 digit to a maximum of three digits.
All these numbers must be valid then:
-0.0012 123400000 0000123000 +45e-123 -0.0000004e+40 (why someone will write a number like this is beyond me)
Some invalid combination include:
-e+4 (No significant digit(s)) a1e-4 (invalid char) 4.5e- (No exponent digit(s)) . (No significant digit(s))
Will those rules we should be able to write some nice code. Note: The use of assembler for this purpose is... not recommendable as this routine will be not only complicated but over all, long, and not time critical at all. But as an exercise is a good one.
ASCIITOBCD
ASCIITOBCD_ret ret
From BCD12 to ASCII
Conversion from BCD12 to ASCII is quite straightforward, but some points should be noted. To how many places do we have to represent the number ?, that is the question to be asked. Complementary of that are we going to use all the available range or just a small subset of numbers ?. As the propeller has a limited COG memory, a small and specifically tailored conversion may be the way to go.
Small and tailored
A small and tailored conversion can be seen as a quick-and-dirty approach, let's say that the numbers to represent are in the tens of thousands, decimals are unimportant, then we can truncate them without looking back. A possibility could be:
' Converts a number in the 10000 to 99999 range to ascii
' Number is in r, sign and decimals are unimportant
' ptr1 is destination
'
TOASCIIQD mov rRExp,rRExp wc 'checks for negative exponent,
if_c mov rt1,#48
if_c call #EMITASCII
if_c jmp #TOASCIIQD_40
' number may be in range
cmp rRExp,#5 wc
if_nc mov rt1,#69 ' E signals error
if_nc call #EMITASCII
if_nc jmp #TOASCIIQD_40
' number is in range
mov rt2,rRExp
add rt2,#1
mov rt3,rR ' working significant
TOASCIIQD_10 mov rt1,rt3
shr rt1,#24
and rt1,#15
add rt1,#48 ' converts digit to ASCII
call #EMITASCII
shl rt3,#4
djnz rt2,#TOASCIIQD_10
TOASCIIQD_40 mov rt1,#0
call #EMITASCII
TOASCIIQD_ret ret
' writes an ascii to HUB and increments pointer
' rt1 is the byte to write
' ptr1 is the pointer
EMITASCII wrbyte rt1,ptr1
add ptr1,#1
EMITASCII_ret ret
The helper routine EMITASCII does... well exactly that!, and increments the pointer. Small helping routines can save a few longs here and there, in some situations, but they can increase execution by 8 cycles each time they are called. So for simple and short loops it is the way to go.
If we were to consider rounding things may get... slower, and longer. The previous example can be taken as rounded with the floor function, or simply put: truncation. Rounding to nearest can be implemented adding 0.5. A simple call to mADD8 with the properly formatted argument can be used... something like this:
' Converts a number in the 10000 to 99999 range to ascii
' Number is in r, sign is unimportant. Rounding is performed if the number is >= 1
' ptr1 is destination
'
TOASCIIQDR mov rRExp,rRExp wc 'checks for negative exponent,
if_c mov rt1,#48
if_c call #EMITASCII
if_c jmp #TOASCIIQDR_40
' number may be in range
cmp rRExp,#5 wc
TOASCIIQDR_5
if_nc mov rt1,#69 ' E signals error
if_nc call #EMITASCII
if_nc jmp #TOASCIIQDR_40
' number is in range, adds rounding factor
mov rt1,rR
mov rt2,#5 ' rounding argument
mov rt3,#5
sub rt3,rRExp
shl rt3,#2
shl rt2,rt3 ' adjust rounding digit
call #mADD8
mov rt2,rRExp
test rt5,cnt_MSD wz
if_nz add rt2,#1 ' increments exponent if overflow
if_nz shr rt5,#4 ' shifts working significant one to the right
cmp rt2,#5 wc
if_nc jmp #TOASCIIQDR_5
add rt2,#1
TOASCIIQDR_10 mov rt1,rt5
shr rt1,#24
and rt1,#15
add rt1,#48 ' converts digit to ASCII
call #EMITASCII
shl rt5,#4
djnz rt2,#TOASCIIQDR_10
TOASCIIQDR_40 mov rt1,#0
call #EMITASCII
TOASCIIQDR_ret ret
' writes an ascii to HUB and increments pointer
' rt1 is the byte to write
' ptr1 is the pointer
EMITASCII wrbyte rt1,ptr1
add ptr1,#1
EMITASCII_ret ret
If all numbers have to be converted, i.e. the full range at full precision, 12 digits, plus sign and exponent a maximum of 19 characters will be generated. Shorter representations, when applicable, are possible depending on the number. A good criteria is if the number of digits to represent is greater than say 12, a number with exponent should be used. It is easier to read 1234 than 1.234e+3. The last notation will be called scientific notation. A smart routine that can differentiate between this two cases has three parts. The first is the conversion to scientific if the exponent is greater than 12 in any direction.
Note: The BCD12 representation of numbers in following examples (the ones on the left) are written in scientific notation for clarity.
The second is the representation of numbers smaller than one, where zeroes should be inserted between the decimal point and the first digit of the significant, the number of zeroes in this case is the absolute value of the exponent minus 1.
The third and final case is that of numbers greater than 1 that have 12 or less digits, zeroes should be removed after the decimal point if there are no significant digit at the right, and zeroes should be inserted between the last digit at the right and the decimal point if required (this last case is the more complex one of the three).
Number | Representation | Comment |
---|---|---|
1.34e100 | 1.34e100 | No prettier form available, exponent > 12 |
3.4e-1 | .34 | |
3.4e-5 | .000034 | In this case some zeroes where added after the decimal point. Those zeroes are not in the original BCD12 number, because they are all normalized |
1.2e1 | 12 | not like 1.2e1 or 12.0000000000 |
1.234e1 | 12.34 | |
4.e+6 | 4000000 | In this case zeroes where added after the last significant digit, the 4, and the decimal point. |
Now, let's see some code:
TOASCII
TOASCII_ret ret
Addition and subtraction of BCD12 numbers
The routine shown before allows for addition of two 8 BCD numbers. Based on this, and with some improvements for speed and size a successful implementation of a complete BCD12 plus BCD12 addition/subtraction would be something like this:
kkBCDSUB xor rBSgn,cnt_SMASK ' I love long routines ;-)
kkBCDADD mov rt1,rASgn
xor rt1,rBSgn
test rt1,cnt_SMASK wz
if_nz jmp #kkSUB
' falls to ADD15
' ******************************************
' ***
' *** addition of two bcd unpacked numbers rR = rA + rB
kkADD mov rt1,rAExp
subs rt1,rBExp wz
abs rt2,rt1
if_z jmp #kkADD_20 ' adds, no shift
cmp rt1,#16 wc
if_c jmp #kkADD_5 ' shifts B
cmp rt2,#16 wc
if_c jmp #kkADD_10
cmp rt1,#16 wc
if_c call #LOADBTOR
if_nc call #LOADATOR
jmp #kkADD_ret
kkADD_5 call #kkmSHRB15
djnz rt2,#kkADD_5
mov rRExp,rAExp ' exponent of A
jmp #kkADD_20
kkADD_10 call #kkmSHRA15
djnz rt2,#kkADD_10
mov rRExp,rBExp ' exponent of B
kkADD_20 movs kkmADD8_1,#rA1
movs kkmADD8_2,#rB1
call #kkmADDSM
mov rR1,rt6
mov rR,rt5
and rt5,cnt_MSD wz
if_nz call #kkmSHRR15
if_nz add rRExp,#1
mov rRSgn,rASgn ' sets sign from A
jmp #kkADD_ret
' ********************************************
' ***
' *** Substraction
' ***
kkSUB mov rt1,rAExp
subs rt1,rBExp wz
movs kkmSUB8_1,#rB1
movs kkmSUB8_2,#rA1 ' prepares for R=B-A
abs rt2,rt1
if_z jmp #kkSUB_15 ' adds, no shift
cmp rt1,#16 wc
if_c jmp #kkSUB_10 ' shifts B
cmp rt2,#16 wc
if_c jmp #kkSUB_5
cmp rt1,#16 wc
if_c call #LOADBTOR
if_nc call #LOADATOR
jmp #kkSUB_ret
' B is bigger than A, we shift A and perform R=B-A
kkSUB_5 call #kkmSHRA15
djnz rt2,#kkSUB_5
jmp #kkSUB_20
' exponents are equal, so check significand
kkSUB_15 call #kkmCMP15
if_c jmp #kkSUB_20 ' sig(A)<sig(B)
jmp #kkSUB_17
' A is bigger than B
kkSUB_10 call #kkmSHRB15
djnz rt2,#kkSUB_10
kkSUB_17 movs kkmSUB8_1,#rA1
movs kkmSUB8_2,#rB1
kkSUB_20
mov rRSgn,rASgn ' transfers sign
mov rRExp,rAExp
call #kkmSUBSM
mov rR1,rt6
mov rR,rt5
kkSUB_25 and rt5,cnt_D12 wz
if_nz jmp #kkSUB_ret
sub rRExp,#1
call #kkmSHLR15
call #kkmCMPRZ ' tests for zero
if_nz jmp #kkSUB_25
call #LOADZTOR
kkSUB_30
kkADD_ret
kkBCDSUB_ret
kkBCDADD_ret
kkSUB_ret ret
' Adds two numbers
kkmADDSM call #kkmADD8
mov rt6,rt5
sub kkmADD8_1,#1
sub kkmADD8_2,#1
call #kkmADD8C
kkmADDSM_ret ret
kkmSUBSM call #kkmSUB8
mov rt6,rt5
sub kkmSUB8_1,#1
sub kkmSUB8_2,#1
call #kkmSUB8C
kkmSUBSM_ret ret
' Adds two 8 digit longs
' carry is used in negative logic !
kkmADD8 mov rcarry,#1 wc ' clears carry
mov rmsk1,#$f
kkmADD8C mov rt5,#0
mov rsh1,#10
kkmADD8_1 mov rt3,0-0 '
and rt3,rmsk1
kkmADD8_2 mov rt4,0-0
and rt4,rmsk1
if_c add rt4,rcarry
add rt3,rt4 wc
if_c add rt3,cnt_SIX ' adds to convert to decimal if rightmost digit
kkmADD8_5 if_nc cmpsub rt3,rsh1 wc
or rt5,rt3
rol rcarry,#4 ' magic
rol rmsk1,#4
shl rsh1,#4 wz
if_nz jmp #kkmADD8_1
kkmADD8C_ret
kkmADD8_ret ret
' Subs two 8 digit longs
kkmSUB8 mov rcarry,#1 wc ' clrs carry flag
mov rmsk1,#$f
kkmSUB8C mov rt5,#0
mov rsh1,#10
kkmSUB8_1 mov rt3,0-0
and rt3,rmsk1
kkmSUB8_2 mov rt4,0-0
and rt4,rmsk1
if_c add rt4,rcarry
sub rt3,rt4 wc
if_c add rt3,rsh1
or rt5,rt3
rol rcarry,#4 ' magic
rol rmsk1,#4
shl rsh1,#4 wz
if_nz jmp #kkmSUB8_1
kkmSUB8C_ret
kkmSUB8_ret ret
' loads A to R
LOADATOR mov rR,rA
mov rR1,rA1
mov rRExp,rAExp
mov rRSgn,rBSgn
LOADATOR_ret ret
' loads B to R
LOADBTOR mov rR,rB
mov rR1,rB1
mov rRExp,rBExp
mov rRSgn,rBSgn
LOADBTOR_ret ret
' loads zero to R
LOADZTOR mov rR,#0
mov rR1,#0
mov rRExp,#0
mov rRSgn,#0
LOADZTOR_ret ret
The subtraction is implemented wrapping the addition with a sign change for the subtraend. Note that there are actually three different stages. the first one represented by kkBCDADD (and kkBCDSUB) are the routines you should call when rA and rB (and the other related variables) have been loaded with the LOADA and LOADB routines described avobe. kkADD is called when the numbers (BCD12) should be added, i.e. when the sign of both are equal. To add the two parts then kkmADDSM is called. This last routine operates over rA:rA1 and rB:rB1 as if they where 16 digit numbers (ignoring the fact that some digits are always zero).
This code uses self-modifying techniques. This saves some longs used by variables and also some time. Compared to the previous routine kkmADD8 is 4 longs shorter and uses less variables, and the nice cmpsub instruction. The carry is now used in positive logic, i.e. a carry set means that the last add gave carry, (contrary to previous use).
Multiplication
Using a modified MUL8 routine, to support the self-modifying version of kkmADD8, a full multiplication ca be easily implemented as shown below.
' ********************************************************
' ****
' **** Multiplication R=A*B
kkBCDMUL mov rRExp,rAExp
adds rRExp,rBExp
' ** do not forget to check for overflow ;-)
mov rR,#0 ' result significand
mov rR1,#0
test rB1,rB1 wz
if_nz call #kkmMUL8 ' avoid 8 zeroes if necessary
kkBCDMUL_5 mov rB1,rB
call #kkmMUL8
call #kkmSHLR15
mov rt1,rR
and rt1,cnt_D12 wz
if_nz add rRExp,#1 ' increments exponent
if_z call #kkmSHLR15 ' normalizes significand
mov rRSgn,rASgn
xor rRSgn,rBSgn
kkBCDMUL15_ret ret
'
' multiplies A*rB1
'
' uses rt1, rt2, rt5, rt6, rcnt1, rp, rB1, rR, rR1
'
' clogged by mADD8 rt3, rt4, rt5, rcarry, rmsk1, rsh1
' clogged by mSHRR15 rt5, rR, rR1
kkmMUL8 mov rt7,#8
kkmMUL8_5 mov rcnt1,rB1
and rcnt1,#$f wz
mov rt6,#0
if_z jmp #kkmMUL8_15
kkmMUL8_10 movs kkmADD8_1,#rA1
movs kkmADD8_2,#rR1
call #kkmADD8
mov rR1,rt5
movs kkmADD8_1,#rA
movs kkmADD8_2,#rR
call #kkmADD8C
mov rR,rt5
if_c add rt6,#1 ' carry counter
djnz rcnt1,#kkmMUL8_10
kkmMUL8_15 mov rt5,#4
kkmMUL8_20 shr rR,#1 wc
rcr rR1,#1
djnz rt5,#kkmMUL8_20
ror rt6,#4 ' convert to MSD
or rR,rt6 ' sets new carry digit
shr rB1,#4
djnz rt7,#kkmMUL8_5
kkmMUL8_ret ret
Division
To divide we implement a similar algorithm as before, i.e. the same algorithm you learned at school. Some helping routines are necessary, shifts and compares. Let's see the code, but before we should note that x/0 and 0/0 will give some errors, in rerr.
.section COG cog0 ' needed for pPropellerSim (like DAT/org combination)
'
'
'
' BCD12 DIVISION
'
' A/B
' Destroys rA:rA1 rB:rB1
ERR_DIV0 = 1
ERR_DIV00 = 2
' ********************************************************
' ****
' ****
' **** Division R = A / B
' ****
kkBCDDIV call #kkmCMPBZ
if_z mov rerr,#ERR_DIV0
if_z jmp #kkBCDDIV_ret
call #kkmCMPAZ
if_z mov rerr,#ERR_DIV00
if_z jmp #kkBCDDIV_ret
' real division, exponent and sign
mov rR,#0
mov rR1,#0
mov rRExp,rAExp
subs rRExp,rBExp
mov rRSgn,rASgn
xor rASgn,rBSgn
mov rt7,#12 ' number of digits
call #kkmCMP15 ' compares A with B
if_c call #kkmSHLA15
if_c subs rRExp,#1 ' decrements exponent if A<B
kkBCDDIV_20 call #kkmCMP15
if_c jmp #kkBCDDIV_30
movs kkmSUB8_1,#rA1
movs kkmSUB8_2,#rB1
call #kkmSUB8
mov rA1,rt5
movs kkmSUB8_1,#rA
movs kkmSUB8_2,#rB
call #kkmSUB8C
mov rA,rt5
if_nc add rR1,#1 ' increments count
jmp #kkBCDDIV_20
kkBCDDIV_30 call #kkmSHLA15 ' if borrow, shift left
call #kkmSHLR15 ' shift for next digit
djnz rt7,#kkBCDDIV_20
call #kkmSHLR15
call #kkmSHLR15 ' adjusts result
kkBCDDIV_ret ret
' shifts A left one digit
kkmSHLA15 mov rt5,rA1
shl rA1,#4
shl rA,#4
shr rt5,#28
or rA,rt5
kkmSHLA15_ret ret
' shifts A right one digit
kkmSHRA15 mov rt5,rA
shr rA,#4
shl rt5,#28
shr rA1,#4
or rA1,rt5
kkmSHRA15_ret ret
' shifts B left one digit
kkmSHLB15 mov rt5,rB1
shl rB1,#4
shl rB,#4
shr rt5,#28
or rB,rt5
kkmSHLB15_ret ret
' shifts B right one digit
kkmSHRB15 mov rt5,rB
shr rB,#4
shl rt5,#28
shr rB1,#4
or rB1,rt5
kkmSHRB15_ret ret
' shifts B left one digit
kkmSHLR15 mov rt5,rR1
shl rR1,#4
shl rR,#4
shr rt5,#28
or rR,rt5
kkmSHLR15_ret ret
' shifts R right one digit
kkmSHRR15 mov rt5,rR
shr rR,#4
shl rt5,#28
shr rR1,#4
or rR1,rt5
kkmSHRR15_ret ret
kkmCMP15 cmp rA,rB wc wz
if_z cmp rA1,rB1 wc wz
kkmCMP15_ret ret
kkmCMPRZ test rR,rR wz
if_z test rR1,rR1 wz
kkmCMPRZ_ret ret
' checks if A or B are zero
kkmCMPAZ test rA,rA wz
if_z test rA1,rA1 wz
kkmCMPAZ_ret ret
kkmCMPBZ test rB,rB wz
if_z test rB1,rB1 wz
kkmCMPBZ_ret ret
cnt_SMASK long $8000_0000 ' sign mask
cnt_D12 long $0f00_0000 ' MSD (digit 12) mask
cnt_2LMASK long $ffff_f000 ' low long mask
Square root
The calculus of the square root can be performed using a plurality of methods, while just some of them are useful for computers others are useful for calculation by hand. A modification of the hand method is shown below, implemented using some of the routines already described. The times were calculated using the corresponding arguments. The secret of the shorter calculatio time reside in the special shift routine kkmSHRRP. This routine will shift right only a part of a number using rt7 as index for this shift. It is implemented as two parts whether the shift occurs in the whole number or only on the right most long. For easier handling the significant is scaled by a factor of 5, making the multiplication by 20 (see hand algorithm at Wikipedia) unnecessary. The exponent is calculated using a simple dive by two in binary, because it is stored as a two's complement number.
' Calculates the square root of the argument in A
' As it is it takes new one, (old one was without self-modifying code):
'
' Input cycles cycles result
' old one new one
' .78 158344 111192 0.883176086632
' 1.0 7012 6564 1.0
' 2.0 107940 76332 1.41421356237
' 5.0 169028 118560 2.23606797749
' 50.0 142408 100176 7.07106781186
' 100.0 7012 6564 10.0
' 1000.0 129128 90996 31.6227766016
' 1.3e+51 134440 94668 3.60555127546e+25
ERR_SQRN = 3
kkBCDSQR call #kkmCMPAZ
if_z call #LOADZTOR
if_z jmp #kkBCDSQR_ret ' argument is zero
cmp rASgn,#0 wz
if_nz mov rerr,#ERR_SQRN ' argument is negative
if_nz jmp #kkBCDSQR_ret
movs kkmADD8_1,#rA1
movs kkmADD8_2,#rA1 ' B+B
call #kkmADDSM
mov rB1,rt6
mov rB,rt5 ' B=A+A
movs kkmADD8_1,#rB1
movs kkmADD8_2,#rB1 ' B=4*A
call #kkmADDSM
mov rB1,rt6
mov rB,rt5
movs kkmADD8_1,#rB1
movs kkmADD8_2,#rA1 ' A=A+B
call #kkmADDSM ' A=5*A
mov rA1,rt6
mov rA,rt5
mov rRExp,rAExp
test rAExp,#1 wz
if_z call #kkmSHRA15 ' shift right if exponent was even
mov rR,cnt_FIVE
mov rR1,#0 ' rt6:rt7 is used to calculate the digits
mov rB,cnt_ONE
mov rB1,#0 ' we initialize constant
mov rt7,#12 ' 12 digits
kkBCDSQR_10 call #kkmSHRRP
kkBCDSQR_17 cmp rA,rR wz wc
if_z cmp rA1,rR1 wz wc
if_c jmp #kkBCDSQR_20
' subtracts result
movs kkmSUB8_1,#rA1
movs kkmSUB8_2,#rR1
call #kkmSUBSM
mov rA1,rt6
mov rA,rt5
' adds one to result
movs kkmADD8_1,#rR1
movs kkmADD8_2,#rB1
call #kkmADDSM
mov rR1,rt6
mov rR,rt5
jmp #kkBCDSQR_17
kkBCDSQR_20 call #kkmSHLA15 ' shifts left remainder
call #kkmSHRB15 ' shift right constant
kkBCDSQR_25 djnz rt7,#kkBCDSQR_10
kkBCDSQR_ret ret
' Rotate right with mask in rt7
kkmSHRRP mov rt4,cnt_ff ' 0xffff_ffff
mov rt3,rt7 ' shift count
cmpsub rt3,#5 wc wr
shl rt3,#2
shl rt4,rt3 ' prepares mask
if_c jmp #kkmSHRRP_20 ' we will see
mov rt2,rR1
andn rt2,rt4
shr rt2,#4
and rR1,rt4
or rR1,rt2
jmp #kkmSHRRP_ret
kkmSHRRP_20 shr rR1,#4
mov rt2,rR
shl rt2,#28
or rR1,rt2 ' lower long ready
mov rt2,rR
andn rt2,rt4 ' right half of high word ready
shr rt2,#4
and rR,rt4
or rR,rt2
kkmSHRRP_ret ret
Transcendentals
The need for floating point can be somewhat mitigated using for instance Fixed notation (a variant for binary floating point), but when transcendental functions are needed, it could be difficult to avoid its use. First of all we will consider several methods to implement some of the functions, angle functions, power and logarithm. With some basic identities, all possible functions can be obtained.
The first function to consider will be the sine function. As everyone knows there are a number of methods to calculate it, power series of several kinds (all related to the Taylor or McLaurin series) and the über-toll CORDIC methods. Floating point coprocessors calculate them using power series while pocket calculators on the other hand, (the HP series and TI series for example) use the CORDIC method, because they work with decimal numbers (BCD) not with binary numbers.
A Taylor series for the sine will be:
x³ x^5 x^7
sin(x) = x - ---- + ---- - ---- + ... + E
3! 5! 7!
The error term will be the difference between the real value (with an infinite number of terms) and the one calculated with a reasonable number of terms. If 16 exact digits are to be calculated a 23 term series is needed. This means, storing 23 constants and doing 22 additions and 24 multiplications, using a simplified version (without Error term), for 4 terms there are 5 multiplications and 3 additions/subtractions:
1 1 x²
sin(x) = x * (1 - x² * (--- + x² * (--- - ---)))
3! 5! 7!
When fast multiplication is available, this method could be used. In real life, some other considerations are taken, like angle reduction (using only 1 quadrant) and partial approximations using a table of pre-calculated constants.
The CORDIC method was developed, sadly, as an aid in missile guiding systems (read: to kill people). But despite that, some good came from it, as is it in widespread use for more edifying purposes, hopefully. It is based on an old method developed in the 17th century by mathematician Briggs. This method is based on algebraic functions, and can be thought out as an I infer the result working with the argument. There is no direct correlation between the argument and the result, because firstly an intermediate set of results has to be calculated to be used to calculate a result. (See Jacques Laporte's site)
Lets see it with a sine case:
Soon to come
Note 1: This examples can be found in this file here-ready-to-be-tested-with-pPropellerSim-or-if-you-change-the-extension-and-add-a-wrapper...-a-ready-object-for-you-to-test-on-a-propeller.-Note-the-license-(LGPL-v2)!
Note 2: The double precision package is here.
All this is Copyright me :-), Pacito.Sys in accordance with the Creative Commons Share-Alike 3.0 License.