Dhrystone - retrotruestory/M1DEV GitHub Wiki

9/15/2004 By Bill Buzbee

First "MIPS" rating is in: I just got the old Dhrystone (1.1) benchmark running, and it reported Magic-1 at 178 dhrystones/second, which computes to 0.11 MIPS (normalized against a Vax 11/780's score of 1560). One oddity is that when I ran it with both fast and slow clocks, I got the same score. This doesn't really surprise me - my bios code is constantly flipping between fast and slow clock speeds, and I bet somewhere I'm not setting it back properly. So, I'm not sure whether that score corresponds to a 1.2 Mhz clock or a 2.4 Mhz clock (probably the latter). For what it's worth, here's Magic-1's score in context:

-	Apple IIe, 1.02 Mhz 65C02 - 37 dhrystones/second
-	CPM 2.5Mhz Z-80 - 91 dhrystones/second
-	Magic-1 2.4 Mhz - 178 dhrystones/second
-	IBM PC/XT 4.77 Mhz 8088 - 259 dhrystones/second
-	PDP-11/34A, Unix V7 - 406 dhrystones/second

Note that it's totally unfair to compare against these old machines for several reasons. First, although my C compiler isn't optimizing, it likely does better than the ones used to compile dhrystone for the early machines. Second, my SRAM is much faster (70ns) than what would have been commonly available on the Apple, CMP and PC/XT in the early 80's. As a result, M-1's instructions don't have any need for wait states. And finally, I designed M-1's ISA with a C compiler in mind.

On the other hand, I haven't even begun serious compiler tuning. I'll bet I can at least double Magic-1's score by tweaking the compiler and pushing up the clock a bit.

I've left the dhrystone image in Magic1's process table, so it can be run by telnetting to Magic-1, selecting the "x" (eXecute) command and then process table slot 0. It takes about half a minute to complete 5000 iterations.

10/23/2004

It turns out I overtuned a bit - my fast strcpy() was fast in large part because it was incorrect. The fix was simple, but dropped me back to 330 or so Dhrystones. I tried various speed-up tricks, but one of the weaknesses of 1-address machines is that memory copies can be very inefficient. For standard byte copy, you need four registers: source address, target address, byte count and a temporary register to store the data being copied.

I considered this while designing Magic-1, and hand-coded a series of memcpy() routines to see how well it would perform with various register configurations. Generally, the code was poor - I just didn't have enough registers. I finally decided to add a special MEMCOPY opcode. This solved the problem because it allowed me to use internal temporary registers to get a very efficient memory move.

However, that instruction just works with counted string (i.e. move a specific number of bytes). For C's strcpy() functionality, you are moving a null-terminated string of bytes. You don't know how long it is before you start - you just move bytes until you have moved a zero byte. Coding this in Magic-1 assembly, I just didn't have enough registers. Registers A and B can be used to load and store, but I'd need to also use either A or B as the intermediate register to hold the byte to move.

To make this long story short, I decided to add a special STRCOPY instruction. It uses an internal temporary register to data transfer, and cheaply does the zero test. Very fast. Also bumped the score up a bit by tuning strcmp().

Magic-1's real Dhrystone is now 384.

While I was messsing with the microcode, I also eliminated a couple of redundant instructions - sh0add a,b,a and sh0add b,b,a (which are equivent to sh0add a,a,b and sh0add b,a,b respectively). I had put them in earlier due to limitations in my assembler. Since then, I've had to add some awk and sed post-processing on the assembly output, and was able to use that to tranform the "b,b,a" version to "b,a,b" and "a,b,a" to "a,a,b".

I don't expect to make many more changes to the instruction set. I have four nops, which gives me room for three new instructions. My plan has always been to add some special purpose Forth instructions, so I'll keep those free for that.