Optimized Object Routine Handling - RetroKoH/S1Fixed GitHub Wiki

(Credit: lavagaming1, RetroKoH)
Source: See commits
Commits: 964a637

Optimal Routine Handler (Two Routines)

As you may know if you are familiar with an object's Object Status Table, OST byte $24 is used as a routine counter. This routine counter is used at or near the start of an object's code as a form of program control, letting the game know how the object should be utilized at a given moment. Because this routine checking process runs once per frame, every frame, for every single object loaded in RAM that uses routines, having an inefficient process can eat tons of cycles. We can provide quite the optimization if we were to fix this up a bit. But how do we do that? Well, let's look at the GHZ Purple Rock as an example:

PurpleRock:
	moveq	#0,d0			; 4 (1/0) 2
	move.b	obRoutine(a0),d0	; 12 (3/0) 4
	move.w	Rock_Index(pc,d0.w),d1	; 14 (3/0) 4
	jmp	Rock_Index(pc,d1.w)	; 14 (2/0) 4
; ===========================================================================
Rock_Index:
	dc.w Rock_Main-Rock_Index
	dc.w Rock_Solid-Rock_Index
; ===========================================================================

This is how routine checking and redirecting works for objects in Sonic 1 (and in later games for that matter). Similar and/or identical methods are also used with obRout2nd and obSubtype in certain objects, and even mechanics like the Object Manager use this method when it comes to routine checking. This method takes 44 cycles and performs 9 reads every frame. Again, it will occur for EVERY instance of the object, on every frame!

How can we improve this? Well, do you see how this rock only has 2 routines? We could simply test the routine byte, and jump to Rock_Solid if the value is non-zero, falling through to Rock_Main otherwise. What does THAT look like? Let's see:

PurpleRock:
	tst.b	obRoutine(a0)	; 12 (3/0) 4
	bne.s	Rock_Solid	; 10 (2/0); 8 (1/0) 2

This simple check takes 12 cycles to test obRoutine (the same number of cycles that it would take to move the byte to d0). Following that, it performs a conditional branch: bne (Branch if Not Equal). If obRoutine is not equal to 0, it takes 10 cycles to branch to Rock_Solid. Otherwise, it takes 8 cycles to fallthrough to code below. The result is a routine handler that takes half of the required cycles. Notice however that it takes more cycles to branch than not. This means that it'd be ideal to branch as little as possible by branching to the routine that is used less frequently. In this case, it'd be the init routine Rock_Main. Let's do that:

PurpleRock:
	tst.b	obRoutine(a0)	; 12 (3/0) 4
!	bne.s	Rock_Main	; 10 (2/0); 8 (1/0) 2

Do note though, that if Rock_Main was supposed to fall through to Rock_Solid, We would need to add a branch at the end of Rock_Main to go to Rock_Solid immediately after. A branch would be 10 additional cycles, but considering it only costs 22 cycles to jump to Rock_Main anyway, it's ok. Also, if we tried employing this method, but needed to use a word-length branch like so:

PurpleRock:
	tst.b	obRoutine(a0)	; 12 (3/0) 4
!	bne.w	Rock_Main	; 10 (2/0); 12 (2/0) 4

Then it'd actually be better to conditionally branch to Rock_Solid instead of Rock_Main. It's a difference of only 2 cycles, but it adds up for every object, on every frame. Not a big deal, but bear this in mind.

Optimal Routine Handler (Three Routines)

This is great, but what about objects with more than 2 routines? As you'd imagine, the strategy needs to be a little different. Let's look at something with 3 routines:

BuzzBomber:
	moveq	#0,d0			; 4 (1/0) 2
	move.b	obRoutine(a0),d0	; 12 (3/0) 4
	move.w	Buzz_Index(pc,d0.w),d1	; 14 (3/0) 4
	jmp	Buzz_Index(pc,d1.w)	; 14 (2/0) 4
; ===========================================================================
Buzz_Index:
	dc.w Buzz_Main-Buzz_Index
	dc.w Buzz_Action-Buzz_Index
	dc.w Buzz_Delete-Buzz_Index
; ===========================================================================

This time, let's look at how frequently our routines are used BEFORE proceeding. It stands to reason that Buzz_Action is used more frequently than the init and destroy routines, so we should NOT branch to that routine, for the sake of efficiency. With three routines, however, we will need a second check. Let's try:

BuzzBomber:
	move.b	obRoutine(a0),d0	; 12 (3/0) 4
	cmpi.b	#2,d0			; 8 (2/0) 4
	beq.s	Buzz_Action		; 10 (2/0); 8 (1/0) 2
		
	tst.b	d0			; 4 (1/0) 2
	bne.w	DeleteObject		; 10 (2/0); 12 (2/0) 4

Now, you may notice two things. The first is that we replaced Buzz_Delete with DeleteObject. This is because Buzz_Delete was simply a branch to DeleteObject. The second thing is that we don't fallthrough straight to Buzz_Action. This is because it would actually take fewer cycles to branch over, than it would to simply fall through, resulting in 30 cycles to access the routine, still better than the 44 needed with the original method. But look at the tst below. Because it operates on d0, it only takes 4 cycles this time, but the branch seems more intensive. The same 10 cycles to branch off, but 12 cycles instead of 8 if we skip the branch, so it is actually better to branch, than not, this time! So, deleting the Buzz Bomber takes your expected 44 cycles (equal to the original method, but still better than before because we aren't jumping to another branch, instead jumping straight to DeleteObject. Initializing takes 46 cycles for a fallthrough, however, 2 cycles worse than before, but for a routine only used once.

Could we do better though? What if we used tst to not only check for non-zero, but also for positive/negative? This could be risky because if we ever changed the object during runtime into something that ran a routine table, it could cause errors or crash the game. Let's see what it looks like though:

BuzzBomber:
	tst.b	obRoutine(a0)	; 12 (3/0) 4
	bpl.s	Buzz_Action	; 10 (2/0); 8 (1/0) 2
	bmi.w	DeleteObject	; 10 (2/0); 12 (2/0) 4

The difference here is that obRoutine will be 00 for Buzz_Main, 02 for Buzz_Action, and a signed negative value like FE for DeleteObject. Cycle times for each are 32(6/0), 22(5/0), and 32(7/0), respectively. This is far more efficient!

Optimal Routine Handler (Four+ Routines)

Now we must tackle scenarios for objects with even more routines. We can't possibly optimize THOSE too, can we? Oh... but we can! First, however, let's look at an alternative method for optimizing the handling of three routines:

Harpoon:
	moveq	#0,d0			; 4(1/0) 2
	move.b	obRoutine(a0),d0	; 12(3/0) 4
	jmp	Harp_Index(pc,d0.w)	; 14(2/0) 4
; ===========================================================================
Harp_Index:
	bra.s	Harp_Main		; 10(2/0) 2
	bra.s	Harp_Move		; 10(2/0) 2
	bra.s	Harp_Wait		; 10(2/0) 2
; ===========================================================================

This isn't nearly as efficient as what we came up with earlier, which ranged from 22 to 32 cycles per routine call, but this is more consistent, sitting at 40 cycles per frame in all cases, still saving 4 cycles over the original handler. As you can see, this performs a jump to a branch instruction based on obRoutine. Because they are short branches, each one is only 2 bytes in length, therefore obRoutine can remain a multiple of 2. Let's look at this modified object though:

Harpoon:
	moveq	#0,d0			; 4(1/0) 2
	move.b	obRoutine(a0),d0	; 12(3/0) 4
	jmp	Harp_Index(pc,d0.w)	; 14(2/0) 4
; ===========================================================================
Harp_Index:
	bra.s	Harp_Main		; 10(2/0) 2
-	bra.w	Harp_Move		; 10(2/0) 4
-	bra.w	Harp_Wait		; 10(2/0) 4
; ===========================================================================

Let's say, for some reason, we added a ton of code to Harp_Main, and the other two routines were further away as a result, requiring word-length branches instead of short branches. They cost the same number of cycles, but are now 4 bytes in length, instead of 2. If ANY of your routine branches are word-length branches instead of short branches, you MUST use multiples of 4 for obRoutine instead of multiples of 2, or you will deal with nasty errors. You could add d0 to d0 at the top, but that will add 4 more cycles, negating any and all optimizations we were hoping for in the first place. Let's see this with an example object with 5 routines:

Cookie:
	moveq	#0,d0			; 4(1/0) 2
	move.b	obRoutine(a0),d0	; 12(3/0) 4
	jmp	Cookie_Index(pc,d0.w)	; 14(2/0) 4
; ===========================================================================
Cookie_Index:
	bra.w	Cookie_Main		; 10(2/0) 4
	bra.w	Cookie_Rout1		; 10(2/0) 4
	bra.w	Cookie_Rout1		; 10(2/0) 4
	bra.w	Cookie_Rout1		; 10(2/0) 4
	bra.w	Cookie_Rout1		; 10(2/0) 4
; ===========================================================================

We likely don't need a word-length branch for the very first routine branch, but it doesn't cost anything to make it such, so this allows routine handling to be ever so slightly more efficient for complex objects as well.

You'll see by now that there is a rarely a one-size fits all answer when it comes to making optimizations. Hopefully, after following this guide, your objects will be a bit more efficient when running.