Optimizing assembly code - pret/pokecrystal GitHub Wiki

Sometimes the simplest way to write something in assembly code isn't the best. All of your resources are limited: CPU speed, ROM size, RAM space, register use. You can rewrite code to use those resources more efficiently (sometimes by trading one for another).

Most of these tricks come from Jeff's GB Assembly Code Tips v1.0, WikiTI's Z80 Optimization page, z80 Heaven's optimization tutorial, and GBDev Wiki's ASM Snippets. (Note that the Game Boy CPU's assembly is called SM83, or colloquially GBZ80. It is not the same as Z80 assembly; the Z80 CPU has more registers and some different instructions.)

WikiTI's advice fully applies here:

Note that the following tricks act much like a peephole optimizer and are the last optimization step: remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.

Also note that nearly every trick turns the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.

Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on their use; comments warn about them. Some tricks apply to other cases, but again you have to be careful.

There are some tricks that are nothing more than the correct use of the available instructions on the Z80. Keeping an instruction set summary helps to visualize what you can do during coding.

(There's also a "cheat sheet" table of instructions summarizing their bytes, cycles, and affected flags, if you don't need a long listing of what each one does.)

8-bit registers
16-bit registers
Branching (control flow)
Subroutines (functions)
Jump and lookup tables
- Chain comparisons
- Off-by-one AddNTimes

8-bit registers

Set `a` to 0

Don't do this:

	ld a, 0 ; 2 bytes, 2 cycles; no changes to flags

Instead, do this:

	xor a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1

Or do this:

	sub a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1

Don't use the optimized versions if you need to preserve flags. As such, ld a, 0 must be left intact in the code below:

	ld a, [wIsTrainerBattle]
	and a   ; sets flag Z to 1 if [wIsTrainerBattle] == 0 or else to 0
	ld a, 0 ; sets a to 0 without affecting flags
	jr nz, .is_trainer_battle
	... ; is not trainer battle

Increment or decrement `a`

When possible, avoid doing this:

	add 1 ; 2 bytes, 2 cycles; sets carry for -1 to 0 overflow

	sub 1 ; 2 bytes, 2 cycles; sets carry for 0 to -1 underflow

If you don't need to set the carry flag, then do this:

	inc a ; 1 byte, 1 cycle

	dec a ; 1 byte, 1 cycle

Multiply `a` by 2

Don't do this:

	sla a ; 2 bytes, 2 cycles

Instead, do this:

	add a ; 1 byte, 1 cycle

Invert the bits of `a`

Don't do this:

	xor $ff ; 2 bytes, 2 cycles

Instead, do this:

	cpl ; 1 byte, 1 cycle

Rotate the bits of `a`

Rotate left through carry

Don't do this:

	rl a ; 2 bytes, 2 cycles; updates Z and C flags

Instead, do this:

	adc a ; 1 byte, 1 cycle; updates Z and C flags

Or, if you don't care about Z, do this:

	rla ; 1 byte, 1 cycle; updates C flag, clears Z flag

For the left shift, see multiply a by 2.

Other rotations

Don't do this:

	rlc a ; 2 bytes, 2 cycles; updates Z and C flags

	rr a ; 2 bytes, 2 cycles; updates Z and C flags

	rrc a ; 2 bytes, 2 cycles; updates Z and C flags

Instead, do this:

	rlca ; 1 byte, 1 cycle; updates C flag, clears Z flag

	rra ; 1 byte, 1 cycle; updates C flag, clears Z flag

	rrca ; 1 byte, 1 cycle; updates C flag, clears Z flag

The exception is if you need to set the zero flag when the operation results in 0 for a; the two-byte operations can set z, the one-byte operations will always clear it.

Load from HRAM to `a` or from `a` to HRAM

Don't do this:

	ld a, [hFoobar] ; 3 bytes, 4 cycles

	ld [hFoobar], a ; 3 bytes, 4 cycles

Instead, do this:

	ldh a, [hFoobar] ; 2 bytes, 3 cycles

	ldh [hFoobar], a ; 2 bytes, 3 cycles

("What's foobar?")

Set `a` to some constant minus `a`

Don't do this:

	; 4 bytes, 4 cycles
	ld b, a
	ld a, FOOBAR
	sub b

Instead, do this:

	; 3 bytes, 3 cycles
	cpl
	add FOOBAR + 1

Or if the constant is zero (i.e. FOOBAR == 0 and FOOBAR - a == -a), then do this:

	; 2 bytes, 2 cycles
	cpl
	inc a

Or if the constant is $FF (aka −1) (i.e. FOOBAR == $FF and FOOBAR - a == ~a), then do this:

	; 1 byte, 1 cycle
	cpl

Set `a` to one constant or another depending on the carry flag

(The example sets a to CVAL if the carry flag - (c), or NCVAL is the carry flag is not set (nc).)

Don't do this:

	; 6 bytes, 6 or 7 cycles
	ld a, CVAL
	jr c, .carry
	ld a, NCVAL
.carry

And don't do this:

	; 6 bytes, 6 or 7 cycles
	ld a, NCVAL
	jr nc, .no_carry
	ld a, CVAL
.no_carry

And if either is 0, don't do this:

	; 5 bytes, 5 cycles
	ld a, CVAL   ; nor NCVAL
	jr c, .carry ; nor jr nc
	xor a
.carry

And if either is 1 more or less than the other, don't do this:

	; 5 bytes, 5 cycles
	ld a, CVAL   ; nor NCVAL
	jr c, .carry ; nor jr nc
	inc a        ; nor dec a
.carry

Instead use sbc a, which copies the carry flag to all bits of a. So do this:

	; 5 bytes, 5 cycles
	sbc a            ; if carry, then $ff, else 0
	and CVAL - NCVAL ; $ff becomes CVAL - NCVAL, 0 stays 0
	add NCVAL        ; CVAL - NCVAL becomes CVAL, 0 becomes NCVAL

Or do this:

	; 5 bytes, 5 cycles
	sbc a            ; if carry, then $ff, else 0
	and CVAL ^ NCVAL ; $ff becomes CVAL ^ NCVAL, 0 stays 0
	xor NCVAL        ; CVAL ^ NCVAL becomes CVAL, 0 becomes NCVAL

And if certain conditions apply, then do something more efficient:

If this case...	...then do this:
`CVAL` == $FF (aka −1) and `NCVAL` == 0	; 1 byte, 1 cycle sbc a ; if carry, then $ff, else 0
`CVAL` == 0 and `NCVAL` == $FF (aka −1)	; 2 bytes, 2 cycles ccf ; invert carry flag sbc a ; if originally carry, then 0, else $ff
`CVAL` == 0 and `NCVAL` == 1	; 2 bytes, 2 cycles sbc a ; if carry, then $ff aka -1, else 0 inc a ; -1 becomes 0, 0 becomes 1
`CVAL` == $FF (aka −1)	; 3 bytes, 3 cycles sbc a ; if carry, then $ff, else 0 or NCVAL ; $ff stays $ff, $00 becomes NCVAL
`NCVAL` == 0	; 3 bytes, 3 cycles sbc a ; if carry, then $ff, else 0 and CVAL ; $ff becomes CVAL, 0 stays 0
`CVAL` == `NCVAL - 1`, aka `CVAL + 1` == `NCVAL`	; 3 bytes, 3 cycles sbc a ; if carry, then $ff aka -1, else 0 add NCVAL ; -1 becomes NCVAL - 1 aka CVAL, 0 becomes NCVAL
`CVAL` == `NCVAL - 2`, aka `CVAL + 2` == `NCVAL`	; 3 bytes, 3 cycles sbc a ; if carry, then $ff aka -1, else 0; doesn't change carry sbc -NCVAL ; -1 becomes NCVAL - 2 aka CVAL, 0 becomes NCVAL
`CVAL` == `NCVAL + 1`, aka `CVAL - 1` == `NCVAL`, and `CVAL` is odd, so `NCVAL` is even	; 3 bytes, 3 cycles ld a, NCVAL / 2 adc a ; a = a * 2 + carry
`CVAL` == 0	; 4 bytes, 4 cycles ccf ; invert carry flag sbc a ; if originally carry, then 0, else $ff and NCVAL ; 0 stays 0, $ff becomes NCVAL
`NCVAL` == $FF (aka −1)	; 4 bytes, 4 cycles ccf ; invert carry flag sbc a ; if originally carry, then 0, else $ff or CVAL ; $00 becomes CVAL, $ff stays $ff
`CVAL` == `NCVAL + 1`, aka `CVAL - 1` == `NCVAL`, and `CVAL` is even, so `NCVAL` is odd	; 4 bytes, 4 cycles ccf ; invert carry flag sbc a ; if originally carry, then 0, else $ff aka -1 add CVAL ; -1 becomes CVAL - 1 aka NCVAL, 0 becomes CVAL
`CVAL` == `NCVAL + 2`, aka `CVAL - 2` == `NCVAL`	; 4 bytes, 4 cycles ccf ; invert carry flag sbc a ; if carry, then 0, else $ff aka -1; doesn't change carry sbc -CVAL ; -1 becomes CVAL - 2 aka NCVAL, 0 becomes CVAL

Increment or decrement `a` when the carry flag is set

Don't do this:

	; 3 bytes, 3 cycles
	jr nc, .ok
	inc a
.ok

	; 3 bytes, 3 cycles
	jr nc, .ok
	dec a
.ok

Instead, do this:

	adc 0 ; 2 bytes, 2 cycles

	sbc 0 ; 2 bytes, 2 cycles

Increment or decrement `a` when the carry flag is not set

Don't do this:

	; 3 bytes, 3 cycles
	jr c, .ok
	inc a
.ok

	; 3 bytes, 3 cycles
	jr c, .ok
	dec a
.ok

Instead, do this:

	sbc -1 ; 2 bytes, 2 cycles

	adc -1 ; 2 bytes, 2 cycles

Toggle `a` between two different constants

Don't do this:

	; 12 bytes, 9 or 10 cycles
	cp FOO
	jr z, .foo_to_bar
	jr .bar_to_foo
.foo_to_bar
	ld a, BAR
	jr .done
.bar_to_foo
	ld a, FOO
.done
	...

And don't do this:

	; 10 bytes, 7 or 9 cycles
	cp FOO
	jr z, .foo_to_bar ; nor jr nz, .bar_to_foo
	ld a, FOO         ; nor ld a, BAR
	jr .done
.foo_to_bar               ; nor .bar_to_foo
	ld a, BAR         ; nor ld a, FOO
.done
	...

(That would be applying the "Conditional fallthrough" optimization to the first way.)

Instead, do this:

	xor FOO ^ BAR ; 2 bytes, 2 cycles

(This works for the same reason as the XOR swap algorithm for swapping the values of two variables.)

Divide `a` by 8 (shift `a` right 3 bits)

Don't do this:

	; 6 bytes, 9 cycles
	; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide)
	ld c, 8 ; divisor
	call SimpleDivide
	ld a, b ; quotient

And don't do this:

	; 6 bytes, 6 cycles
	srl a
	srl a
	srl a

Instead, do this:

	; 5 bytes, 5 cycles
	rrca
	rrca
	rrca
	and %00011111

Divide `a` by 16 (shift `a` right 4 bits)

Don't do this:

	; 6 bytes, 9 cycles
	; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide)
	ld c, 16 ; divisor
	call SimpleDivide
	ld a, b ; quotient

And don't do this:

	; 8 bytes, 8 cycles
	srl a
	srl a
	srl a
	srl a

Instead, do this:

	; 4 bytes, 4 cycles
	swap a
	and $f

Set `a` to some value plus or minus carry

(The example uses b and c, but any registers besides a would also work, including [hl].)

Don't do this:

	; 4 bytes, 4 cycles
	ld b, a
	ld a, c
	adc 0

	; 4 bytes, 4 cycles
	ld b, a
	ld a, c
	sbc 0

And don't do this:

	; 4 bytes, 4 cycles
	ld b, a
	ld a, 0
	adc c

	; 4 bytes, 4 cycles
	ld b, a
	ld a, 0
	sbc c

Instead, do this:

	; 3 bytes, 3 cycles
	ld b, a
	adc c
	sub b

	; 3 bytes, 3 cycles
	ld b, a
	sbc b
	add c

Also, don't do this:

	; 5 bytes, 5 cycles
	ld b, a
	ld a, N
	adc 0

	; 5 bytes, 5 cycles
	ld b, a
	ld a, N
	sbc 0

And don't do this:

	; 5 bytes, 5 cycles
	ld b, a
	ld a, 0
	adc N

	; 5 bytes, 5 cycles
	ld b, a
	ld a, 0
	sbc N

Instead, do this:

	; 4 bytes, 4 cycles
	ld b, a
	adc N
	sub b

	; 4 bytes, 4 cycles
	ld b, a
	sbc b
	add N

(If the original value of a was not backed up in b, this optimization would not apply.)

Add or subtract the carry flag from a register besides `a`

(The example uses b, but any of c, d, e, h, l, or [hl] would also work.)

Don't do this:

	; 4 bytes, 4 cycles
	ld a, b
	adc 0
	ld b, a

	; 4 bytes, 4 cycles
	ld a, b
	sbc 0
	ld b, a

And don't do this:

	; 4 bytes, 4 cycles
	ld a, 0
	adc b
	ld b, a

	; 4 bytes, 4 cycles
	ld a, 0
	sbc b
	ld b, a

Instead, do this:

	; 3 bytes, 3 or 4 cycles
	jr nc, .no_carry
	inc b
.no_carry

	; 3 bytes, 3 or 4 cycles
	jr nc, .no_carry
	dec b
.no_carry

Reverse the bits of `a`

(This optimization is based on Retro Programming).

(The example uses b, but any of c, d, e, h, l, or [hl] would also work.)

Don't do this:

	; 26 bytes, 26 cycles
rept 8
	rra  ; nor rla
	rl b ; nor rr b
endr
	ld a, b

And don't do this:

	; 17 bytes, 17 cycles
	ld b, a
	rlca
	rlca
	xor b
	and $aa
	xor b
	ld b, a
	rlca
	rlca
	rlca
	rrc b
	xor b
	and $66
	xor b

(That would be applying the exact Z80 optimization from Retro Programming, without any SM83-specific operations.)

Instead, do this:

	; 15 bytes, 15 cycles
	ld b, a
	rlca
	rlca
	xor b
	and $aa
	xor b
	ld b, a
	swap b
	xor b
	and $33
	xor b
	rrca

Or if you really want to optimize for size over speed, then don't do this:

	; 10 bytes, 59 cycles
	ld bc, 8  ; lb bc, 0, 8
.loop
	rra  ; nor rla
	rl b ; nor rr b
	dec c
	jr nz, .loop
	ld a, b

Instead, do this:

	; 8 bytes, 50 cycles
	ld b, 1
.loop
	rra
	rl b
	jr nc, .loop
	ld a, b

Or if you can spare hl, then do this:

	; 7 bytes, 50 cycles
	ld h, a
	ld a, $80
.loop
	add hl, hl
	rra
	jr nc, .loop

Or if you really want to optimize for speed over total size, then do this:

	; 6 bytes, 12 cycles
	; (4 bytes, 5 cycles if you don't need the push hl/pop hl)
	push hl
	ld h, HIGH(ReversedBitTable)
	ld l, a
	ld a, [hl]
	pop hl

	; 256 bytes; placed in ROM0 or the same ROMX section as the bit reversal
SECTION "ReversedBitTable", ROM0, ALIGN[8]
ReversedBitTable::
for x, 256
	; https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith32Bits
	db LOW(((((x * $802) & $22110) | ((x * $8020) & $88440)) * $10101) >> 16)
endr

Count the set bits of a register besides `a`

(This optimization is based on WikiTI).

(The examples count the set bits of c and may also use b, but any registers besides a would also work.)

Don't do this:

	; 26 bytes, 26 cycles
	xor a
	ld b, a
rept 8
	rrc c
	adc b
endr

Instead, do this:

	; 20 bytes, 20 cycles
	ld a, c
	and $aa
	cpl
	rrca
	adc c
	ld b, a
	and $33
	ld c, a
	xor b
	rrca
	rrca
	add c
	ld c, a
	swap a
	add c
	and $0f

Or if you want to optimize for size over speed, then don't do this:

	; 12 bytes, 68 cycles; counts bits in c, uses b
	ld a, c
	ld bc, $800  ; lb bc, 8, 0
.loop
	add a
	jr nc, .next
	inc c
.next
	dec b
	jr nz, .loop
	ld a, c

And don't do this:

	; 11 bytes, up to 67 cycles; counts bits in c
	ld a, c
	ld c, 0
.loop
	add a
	jr nc, .next
	inc c
.next
	and a
	jr nz, .loop
	ld a, c

And don't do this:

	; 9 bytes, up to 63 cycles; counts bits in c
	ld a, -1
.next
	inc a
.loop
	srl c
	jr c, .next
	jr nz, .loop

But do this:

	; 7 bytes, up to 49 cycles; counts bits in c
	ld a, c
	add c
.loop
	sub c
	srl c
	jr nz, .loop

Merge some bits of a register with `a`

(This optimization is based on Bit Twiddling Hacks).

(The example uses b, but any of c, d, e, h, l, or [hl] would also work.)

Don't do this:

	; 7 bytes, 7 cycles; sets a = (a & MASK) | (b & ~MASK)
	and MASK
	ld c, a
	ld a, b
	and ~MASK ; or $ff ^ MASK, or $ff - MASK
	or c

Instead, do this:

	; 4 bytes, 4 cycles; no third register
	xor b
	and MASK
	xor b

(For example, if MASK were $f0, then ~MASK would be $0f, and this would merge the high nybble of a with the low nybble of b.)

Or if you can spare hl, then don't do this:

	; 9 bytes, 10 cycles; sets a = (a & MASK) | ([wFoobar] & ~MASK)
	and MASK
	ld c, a
	ld a, [wFoobar]
	and ~MASK
	or c

Instead, do this:

	; 7 bytes, 9 cycles; uses hl
	ld hl, wFoobar
	xor [hl]
	and MASK
	xor [hl]

16-bit registers

Multiply `hl` by 2

Don't do this:

	; 4 bytes, 4 cycles
	sla l
	rl h

Instead, do this:

	add hl, hl ; 1 byte, 2 cycles

Add `a` to a 16-bit register

(The example uses hl, but bc or de would also work.)

Don't do this:

	; 6 bytes, 6 cycles
	add l
	ld l, a
	ld a, 0
	adc h
	ld h, a

And don't do this:

	; 6 bytes, 6 cycles
	add l
	ld l, a
	ld a, h
	adc 0
	ld h, a

And don't do this:

	; 5 bytes, 5 cycles
	add l
	ld l, a
	jr nc, .no_carry
	inc h
.no_carry

Instead, do this:

	; 5 bytes, 5 cycles; no labels
	add l
	ld l, a
	adc h
	sub l
	ld h, a

Or if you can spare another 16-bit register and want to optimize for size over speed, then do this:

	; 4 bytes, 5 cycles
	ld d, 0
	ld e, a
	add hl, de

Subtract a constant from a 16-bit register

(The example uses hl, but bc or de would also work.)

Do this:

	; 8 bytes, 8 cycles
	ld a, l
	sub LOW(FooBar)
	ld l, a
	ld a, h
	sbc HIGH(FooBar)
	ld h, a

Or if the constant is 8-bit (i.e. HIGH(FooBar) == 0), then do this:

	; 7 bytes, 7 cycles
	ld a, l
	sub FooBar
	ld l, a
	jr nc, .no_carry
	dec h
.no_carry

(This is a case of "Add or subtract the carry flag from a register besides a", applied to the high part of a 16-bit register.)

Or if you can spare another 16-bit register, do this:

	; 4 bytes, 5 cycles
	ld de, -FooBar
	add hl, de

Set a 16-bit register to `a` plus a constant

(The example uses hl, but bc or de would also work.)

Don't do this:

	; 7 bytes, 8 cycles; uses another 16-bit register
	ld e, a
	ld d, 0
	ld hl, FooBar
	add hl, de

And don't do this:

	; 8 bytes, 8 cycles
	ld hl, FooBar
	add l
	ld l, a
	adc h
	sub l
	ld h, a

And don't do this:

	; 8 bytes, 8 cycles
	ld h, HIGH(FooBar)
	add LOW(FooBar)
	ld l, a
	jr nc, .no_carry
	inc h
.no_carry

Instead, do this:

	; 7 bytes, 7 cycles
	add LOW(FooBar)
	ld l, a
	adc HIGH(FooBar)
	sub l
	ld h, a

Or if the constant is 8-bit and nonzero (i.e. 0 < FooBar < 256), then do this:

	; 6 bytes, 6 cycles
	sub LOW(-FooBar)
	ld l, a
	sbc a
	inc a
	ld h, a

Or if the constant is zero (i.e. FooBar == 0 and a + FooBar == a), then do this:

	; 3 bytes, 3 cycles
	ld l, a
	ld h, 0

Set a 16-bit register to `a` multiplied by 16

(The example uses hl, but bc or de would also work.)

You can do this:

	; 7 bytes, 11 cycles
	ld l, a
	ld h, 0
	add hl, hl
	add hl, hl
	add hl, hl
	add hl, hl

	; 7 bytes, 11 cycles
	ld l, a
	ld h, 0
rept 4
	add hl, hl
endr

But if a is definitely small enough, and its value can be changed, then do one of these:

	; 7 bytes, 10 cycles; sets a = a * 2; requires a < $80
	add a
	ld l, a
	ld h, 0
	add hl, hl
	add hl, hl
	add hl, hl

	; 7 bytes, 9 cycles; sets a = a * 4; requires a < $40
	add a
	add a
	ld l, a
	ld h, 0
	add hl, hl
	add hl, hl

	; 7 bytes, 8 cycles; sets a = a * 8; requires a < $20
	add a
	add a
	add a
	ld l, a
	ld h, 0
	add hl, hl

	; 5 bytes, 5 cycles; sets a = a * 16; requires a < $10
	swap a
	ld l, a
	ld h, 0

Or if the value of a can be changed and you want to optimize for speed over size, then do one of these:

	; 8 bytes, 8 cycles; sets a = l
	swap a
	ld l, a
	and $f
	ld h, a
	xor l
	ld l, a

	; 8 bytes, 8 cycles; sets a = h
	swap a
	ld h, a
	and $f0
	ld l, a
	xor h
	ld h, a

Sign-extend `a` into a 16-bit register

(This optimization is based on Plutiedev and GBDev Wiki's ASM Snippets.)

(The example uses hl, but bc or de would also work.)

Don't do this:

	; 10 bytes, 9 or 10 cycles
	ld l, a
	cp $80    ; nor bit 7, a
	ld a, $00
	jr c, .ok ; nor jr z, .ok
	ld a, $ff
.ok
	ld h, a

And don't do these:

	; 9 bytes, 8 or 9 cycles
	ld l, a
	cp $80    ; nor bit 7, a
	ld a, $00
	jr c, .ok ; nor jr z, .ok
	dec a
.ok
	ld h, a

	; 9 bytes, 8 or 9 cycles
	ld l, a
	cp $80     ; nor bit 7, a
	ld a, $ff
	jr nc, .ok ; nor jr nz, .ok
	inc a
.ok
	ld h, a

And don't do these:

	; 9 bytes, 8 or 9 cycles
	ld l, a
	rlca ; nor add a
	ld a, $00
	jr nc, .ok
	ld a, $ff
.ok
	ld h, a

	; 9 bytes, 8 or 9 cycles
	ld l, a
	rlca ; nor add a
	ld a, $ff
	jr c, .ok
	ld a, $00
.ok
	ld h, a

(Those would be applying the "Test whether a is negative (compare a to $80)" optimization to the first way.)

And don't do this:

	; 6 bytes, 6 cycles
	ld l, a
	cp $80
	ccf
	sbc a
	ld h, a

(That would be applying the "Set a to one constant or another depending on the carry flag" optimization to the first way.)

Instead, do this:

	; 4 bytes, 4 cycles
	ld l, a
	rlca ; or add a
	sbc a
	ld h, a

(That applies both optimizations to the first way.)

Increment or decrement a 16-bit register

When possible, avoid doing this:

	inc hl ; 1 byte, 2 cycles

	dec hl ; 1 byte, 2 cycles

If the low byte definitely won't overflow, then do this:

	inc l ; 1 byte, 1 cycle

	dec l ; 1 byte, 1 cycle

This is applicable, for instance, if you're reading a data table via hl one byte at a time, it has no more than 256 entries, and it's in its own SECTION which has been ALIGNed to 8 bits. It's unlikely to apply to pokecrystal's existing systems.

Add or subtract the carry flag from a 16-bit register

(The example uses hl, but bc or de would also work.)

Don't do this:

	; 8 bytes, 8 cycles
	ld a, l ; nor ld a, 0
	adc 0   ; nor adc l
	ld l, a
	ld a, h ; nor ld a, 0
	adc 0   ; nor adc h
	ld h, a

	; 8 bytes, 8 cycles
	ld a, l ; nor ld a, 0
	sbc 0   ; nor sbc l
	ld l, a
	ld a, h ; nor ld a, 0
	sbc 0   ; nor sbc h
	ld h, a

And don't do this:

	; 7 bytes, 7 cycles
	ld a, l ; nor ld a, 0
	adc 0   ; nor adc l
	ld l, a
	adc h
	sub l
	ld h, a

	; 7 bytes, 7 cycles
	ld a, l ; nor ld a, 0
	sbc 0   ; nor sbc l
	ld l, a
	sbc h
	add l
	ld h, a

(That would be applying the "Set a to some value plus or minus carry" optimization to part of the first way.)

And don't do this:

	; 7 bytes, 7 or 8 cycles
	ld a, l ; nor ld a, 0
	adc 0   ; nor adc l
	ld l, a
	jr nc, .no_carry
	inc h
.no_carry

	; 7 bytes, 7 or 8 cycles
	ld a, l ; nor ld a, 0
	sbc 0   ; nor sbc l
	ld l, a
	jr nc, .no_carry
	dec h
.no_carry

(That would be applying the "Add or subtract the carry flag from a register besides a" optimization to part of the first way.)

Instead, do this:

	; 3 bytes, 4 or 5 cycles
	jr nc, .no_carry
	inc hl
.no_carry

	; 3 bytes, 4 or 5 cycles
	jr nc, .no_carry
	dec hl
.no_carry

Load from an address to `hl`

Don't do this:

	; 8 bytes, 10 cycles
	ld a, [wFoobar] ; LSB first
	ld l, a
	ld a, [wFoobar+1]
	ld h, a

Instead, do this:

	; 6 bytes, 8 cycles
	ld hl, wFoobar
	ld a, [hli]
	ld h, [hl]
	ld l, a

And don't do this:

	; 8 bytes, 10 cycles
	ld a, [wFoobar] ; MSB first
	ld h, a
	ld a, [wFoobar+1]
	ld l, a

Instead, do this:

	; 6 bytes, 8 cycles
	ld hl, wFoobar
	ld a, [hli]
	ld l, [hl]
	ld h, a

Load from an address to `sp`

Don't do this:

	; 9 bytes, 12 cycles
	ld a, [wFoobar]
	ld l, a
	ld a, [wFoobar+1]
	ld h, a
	ld sp, hl

And don't do this:

	; 7 bytes, 10 cycles
	ldh a, [hFoobar]
	ld l, a
	ldh a, [hFoobar+1]
	ld h, a
	ld sp, hl

And don't do this:

	; 7 bytes, 10 cycles
	ld hl, wFoobar
	ld a, [hli]
	ld h, [hl]
	ld l, a
	ld sp, hl

(That would be applying the "Load from an address to hl" optimization to the first way.)

Instead, do this:

	; 5 bytes, 8 cycles
	ld sp, wFoobar
	pop hl
	ld sp, hl

Or if the address is already in hl, then don't do this:

	; 4 bytes, 7 cycles
	ld a, [hli]
	ld h, [hl]
	ld l, a
	ld sp, hl

Instead, do this:

	; 3 bytes, 7 cycles
	ld sp, hl
	pop hl
	ld sp, hl

Negate a 16-bit register

(The example uses hl, but bc or de would also work.)

Don't do this:

	; 9 bytes, 10 cycles
	ld a, $ff
	xor h
	ld h, a
	ld a, $ff
	xor l
	ld l, a
	inc hl

And don't do this:

	; 7 bytes, 8 cycles
	ld a, h
	cpl
	ld h, a
	ld a, l
	cpl
	ld l, a
	inc hl

And don't do this:

	; 7 bytes, 7 cycles
	xor a
	sub l
	ld l, a
	ld a, 0
	sbc h
	ld h, a

Instead, do this:

	; 6 bytes, 6 cycles
	xor a
	sub l
	ld l, a
	sbc a
	sub h
	ld h, a

Exchange two 16-bit registers

(The example uses hl and de, but any pair of bc, de, or hl would also work.)

If you care about speed, then do this:

	; 6 bytes, 6 cycles
	ld a, d
	ld d, h
	ld h, a
	ld a, e
	ld e, l
	ld l, a

If you care about size, then do this:

	; 4 bytes, 9 cycles
	push de
	ld d, h
	ld e, l
	pop hl

Add two 16-bit registers

(The example adds bc to de, but adding de to bc would also work.)

Do this:

	; 6 bytes, 6 cycles
	ld a, e
	add c
	ld e, a
	ld a, d
	adc b
	ld d, a

Or if you can spare hl and really want to optimize for size, then do this:

	; 5 bytes, 6 cycles; uses hl
	ld h, d
	ld l, e
	add hl, bc
	ld d, h
	ld e, l

Or if hl is one of the registers you want to add, then do this:

	; 1 byte, 2 cycles
	add hl, bc

Subtract two 16-bit registers

(The example uses hl and de, but any pair of bc, de, or hl would also work.)

Don't do this:

	; 8 bytes, 10 cycles; modifies subtrahend de
	ld a, d
	cpl
	ld d, a
	ld a, e
	cpl
	ld e, a
	inc de
	add hl, de

And don't do this:

	; 7 bytes, 8 cycles; modifies subtrahend de
	xor a
	sub e
	ld e, a
	sbc a
	sub d
	ld d, a
	add hl, de

Instead, do this:

	; 6 bytes, 6 cycles
	ld a, l
	sub e
	ld l, a
	ld a, h
	sbc d
	ld h, a

Load two constants into a register pair

(The example uses bc, but hl or de would also work.)

Don't do this:

	; 4 bytes, 4 cycles
	ld b, FOO
	ld c, BAR

Instead, do this:

	ld bc, FOO << 8 | BAR ; 3 bytes, 3 cycles

Or better, use the lb macro in macros/code.asm:

	lb bc, FOO, BAR ; 3 bytes, 3 cycles

Load a constant into `[hl]`

Don't do this:

	; 3 bytes, 4 cycles
	ld a, FOOBAR
	ld [hl], a

Instead, do this:

	ld [hl], FOOBAR ; 2 bytes, 3 cycles

Increment or decrement `[hl]`

Don't do this:

	; 3 bytes, 5 cycles
	ld a, [hl]
	inc a
	ld [hl], a

	; 3 bytes, 5 cycles
	ld a, [hl]
	dec a
	ld [hl], a

Instead, do this:

	inc [hl] ; 1 bytes, 3 cycles

	dec [hl] ; 1 bytes, 3 cycles

Load a constant into `[hl]` and increment or decrement `hl`

Don't do this:

	; 2 bytes, 4 cycles
	ld [hl], a
	inc hl

	; 2 bytes, 4 cycles
	ld [hl], a
	dec hl

Instead, do this:

	ld [hli], a ; 1 bytes, 2 cycles

	ld [hld], a ; 1 bytes, 2 cycles

And if you can use a, then don't do this:

	; 3 bytes, 5 cycles
	ld [hl], FOO
	inc hl

	; 3 bytes, 5 cycles
	ld [hl], FOO
	dec hl

Instead, do this:

	; 3 bytes, 4 cycles
	ld a, FOO
	ld [hli], a

	; 3 bytes, 4 cycles
	ld a, FOO
	ld [hld], a

Load a register into `[hl]` and increment or decrement `hl`

(The example uses b, but any of c, d, e, h, or l would also work.)

Do this:

	; 2 bytes, 4 cycles
	ld [hl], b
	inc hl

	; 2 bytes, 4 cycles
	ld [hl], b
	dec hl

Or if you can use a, then do this:

	; 2 bytes, 3 cycles
	ld a, b
	ld [hli], a

	; 2 bytes, 3 cycles
	ld a, b
	ld [hld], a

Branching (control flow)

Relative jumps

Don't do this:

	jp Somewhere ; 3 bytes, 4 cycles

Instead, do this:

	jr Somewhere ; 2 bytes, 3 cycles

This only applies if Somewhere is within ±128 bytes of the jump.

You can define a jmp macro to use instead of jp, which will warn you when it can be jr instead:

MACRO jmp
	jp \#
	assert warn, (\<_NARG>) - @ > 127 || (\<_NARG>) - @ < -129, "jp can be jr"
ENDM

Compare `a` to 0

Don't do this:

	cp 0 ; 2 bytes, 2 cycles

And don't do this:

	or 0 ; 2 bytes, 2 cycles

And don't do this:

	and $ff ; 2 bytes, 2 cycles

Instead, do this:

	or a ; 1 byte, 1 cycle

Or do this:

	and a ; 1 byte, 1 cycle

Compare `a` to 1

Do this:

	cp 1 ; 2 bytes, 2 cycles; updates Z and C flags

Or if you don't care about the value in a, and don't need to set the carry flag, then do this:

	dec a ; 1 byte, 1 cycle; decrements a, updates Z flag

Note that you can still do inc a afterwards, which is one cycle faster if the jump is taken. Compare this:

	; 4 bytes, 4 or 5 cycles
	cp 1
	jr z, .equals1

with this:

	; 4 bytes, 4 cycles
	dec a
	jr z, .equals1
	inc a

Compare `a` to 255

(255, or $FF in hexadecimal, is the same as −1 due to two's complement.)

Do this:

	cp $ff ; 2 bytes, 2 cycles; updates Z and C flags

Or if you don't care about the value in a, and don't need to set the carry flag, then do this:

	inc a ; 1 byte, 1 cycle; increments a, updates Z flag

Note that you can still do dec a afterwards, which is one cycle faster if the jump is taken. Compare this:

	; 4 bytes, 4 or 5 cycles
	cp $ff
	jr z, .equals255

with this:

	; 4 bytes, 4 cycles
	inc a
	jr z, .equals255
	dec a

Compare `a` to 0 after masking it

Don't do this:

	; 3 bytes, 3 cycles; sets zero flag if a == 0
	and MASK
	and a

Instead, do this:

	and MASK ; 2 bytes, 2 cycles; sets zero flag if a == 0

Compare `a` to a mask after masking it

Don't do this:

	; 4 bytes, 4 cycles; sets zero flag if a == MASK and carry flag if a < MASK
	and MASK
	cp MASK

If you don't need to set the carry flag, and don't need the masked value of a, then do this:

	; 3 bytes, 3 cycles; sets zero flag if a was equal to MASK
	or ~MASK ; or $ff ^ MASK, or $ff - MASK
	inc a

Or do this:

	; 3 bytes, 3 cycles; sets zero flag if a was equal to MASK
	cpl
	and MASK

Test whether `a` is negative (compare `a` to $80)

If you don't need to preserve the value in a, then don't do this:

	; 4 bytes, 4 or 5 cycles
	cp $80
	jr nc, .negative

And don't do this:

	; 4 bytes, 4 or 5 cycles
	bit 7, a
	jr nz, .negative

Instead, do this:

	; 3 bytes, 3 or 4 cycles; modifies a
	rlca ; or add a
	jr c, .negative

Subroutines (functions)

Tail call optimization

Don't do this:

	; 4 bytes, 10 cycles
	call Function
	ret

Instead, do this:

	jp Function ; 3 bytes, 4 cycles

Call `hl`

Don't do this:

	; 5 bytes, 8 cycles
	(some code)
	ld de, .return
	push de
	jp hl

.return:
	(some more code)

Instead, do this:

	; 3 bytes, 6 cycles
	; (4 bytes, 7 cycles, counting the definition of _hl_)
	(some code)
	call _hl_
	(some more code)

_hl_ is a routine already defined in home/call_regs.asm:

_hl_::
	jp hl

Inlining

Don't do this:

	; 4 additional bytes, 10 additional cycles
	(some code)
	call Function
	(some more code)

Function:
	(function code)
	ret

if Function is only called a handful of times. Instead, do:

	(some code)

	; Function
	(function code)

	(some more code)

You shouldn't do this if Function used any returns besides the one at the very end, or if inlining its code would make some jrs too distant from their targets.

Fallthrough

Don't do this:

	(some code)
	call Function
	ret

Function:
	(function code)
	ret

And don't do this:

	(some code)
	jp Function

Function:
	(function code)
	ret

Instead, do this:

	(some code)
	; fallthrough
Function:
	(function code)
	ret

Fallthrough is what you get when you combine inlining with tail calls. You can still call Function elsewhere, but one tail call can be optimized into a fallthrough.

Conditional fallthrough

(The example uses z, but nz, c, or nc would also work.)

Don't do this:

	(some code)
	jr z, .foo
	jr .bar

.foo
	(foo code)

.bar
	(bar code)

Instead, do this:

	(some code)
	jr nz, .bar
	; fallthrough
.foo
	(foo code)

.bar
	(bar code)

Conditional return

(The example uses z, but nz, c, or nc would also work.)

Don't do this:

	; 3 bytes, 3 or 6 cycles
	jr z, .skip
	ret
.skip
	...

And don't do this:

	; 3 bytes, 7 or 2 cycles
	jr nz, .return
	...

.return
	ret

Instead, do this:

	; 1 byte, 5 or 2 cycles
	ret nz
	...

Conditional call

(The example uses z, but nz, c, or nc would also work.)

Don't do this:

	; 5 bytes, 3 or 9 cycles
	jr nz, .skip
	call Foo
.skip

Instead, do this:

	; 3 bytes, 6 or 3 cycles
	call z, Foo

And don't do this:

	; 5 bytes, 3 or 9 cycles
	jr nz, .skip
	jp Foo
.skip

Instead, do this:

	; 3 bytes, 6 or 3 cycles
	jp z, Foo

Conditional `rst $38`

(The example uses z, but nz, c, or nc would also work.)

Don't do this:

	; 5 bytes, 3 or 14 cycles
	call z, RstVector38
	...

RstVector38:
	rst $38
	ret

And don't do this:

	; 3 bytes, 3 or 6 cycles
	jr nz, .no_rst_38
	rst $38
.no_rst_38
	...

And don't do this:

	; 3 bytes, 3 or 6 cycles
	call z, $0038
	...

Instead, do this:

	; 2 bytes, 2 or 7 cycles
	jr z, @ + 1 ; the byte for @ + 1 is $ff, which is the opcode for rst $38
	...

(The label @ evaluates to the current pc value, which in jr z, @ + 1 is right before the jr instruction. The instruction consists of two bytes, the opcode and the relative offset. @ + 1 evaluates to in-between those two bytes. The jr instruction encodes its offset relative to the end of the instruction, i.e. the next pc value after the instruction has been read, so the relative offset is -1, aka $ff.)

Enable interrupts and return

Don't do this:

	; 2 bytes, 5 cycles
	ei
	ret

Instead, do this:

	; 1 byte, 4 cycles
	reti

Jump and lookup tables

Chain comparisons

Don't do this:

	cp 1
	jr z, .equals1
	cp 2
	jr z, .equals2
	cp 3
	jr z, .equals3
	...

Instead, do this:

	dec a
	jr z, .equals1
	dec a
	jr z, .equals2
	dec a
	jr z, .equals3
	...

Or do this:

	dec a
	ld hl, .jumptable
	ld e, a
	ld d, 0
	add hl, de
	add hl, de
	ld a, [hli]
	ld h, [hl]
	ld l, a
	jp hl

.jumptable:
	dw .equals1
	dw .equals2
	dw .equals3
	...

Or better, do:

	dec a
	ld hl, .jumptable
	rst JumpTable
	ret

.jumptable:
	dw .equals1
	dw .equals2
	dw .equals3
	...

JumpTable is an rst routine already defined in home/header.asm:

JumpTable::
	push de
	ld e, a
	ld d, 0
	add hl, de
	add hl, de
	ld a, [hli]
	ld h, [hl]
	ld l, a
	pop de
	jp hl

Off-by-one `AddNTimes`

Don't do this:

	ld hl, Foo
	ld bc, BAR
	dec a
	call AddNTimes

Instead, as long as you don't need to add 255 times when a is 0, then do this:

	ld hl, Foo - BAR
	ld bc, BAR
	call AddNTimes

Optimizing assembly code - pret/pokecrystal GitHub Wiki

Contents

8-bit registers

Set a to 0

Increment or decrement a

Multiply a by 2

Invert the bits of a

Rotate the bits of a

Rotate left through carry

Other rotations

Load from HRAM to a or from a to HRAM

Set a to some constant minus a

Set a to one constant or another depending on the carry flag

Increment or decrement a when the carry flag is set

Increment or decrement a when the carry flag is not set

Toggle a between two different constants

Divide a by 8 (shift a right 3 bits)

Divide a by 16 (shift a right 4 bits)

Set a to some value plus or minus carry

Add or subtract the carry flag from a register besides a

Reverse the bits of a

Count the set bits of a register besides a

Merge some bits of a register with a

16-bit registers

Multiply hl by 2

Add a to a 16-bit register

Subtract a constant from a 16-bit register

Set a 16-bit register to a plus a constant

Set a 16-bit register to a multiplied by 16

Sign-extend a into a 16-bit register

Increment or decrement a 16-bit register

Add or subtract the carry flag from a 16-bit register

Load from an address to hl

Load from an address to sp

Negate a 16-bit register

Exchange two 16-bit registers

Add two 16-bit registers

Subtract two 16-bit registers

Load two constants into a register pair

Load a constant into [hl]

Increment or decrement [hl]

Load a constant into [hl] and increment or decrement hl

Load a register into [hl] and increment or decrement hl

Branching (control flow)

Relative jumps

Compare a to 0

Compare a to 1

Compare a to 255

Compare a to 0 after masking it

Compare a to a mask after masking it

Test whether a is negative (compare a to $80)

Subroutines (functions)

Tail call optimization

Call hl

Inlining

Fallthrough

Conditional fallthrough

Conditional return

Conditional call

Conditional rst $38

Enable interrupts and return

Jump and lookup tables

Chain comparisons

Off-by-one AddNTimes

⚠️ **GitHub.com Fallback** ⚠️

Set `a` to 0

Increment or decrement `a`

Multiply `a` by 2

Invert the bits of `a`

Rotate the bits of `a`

Load from HRAM to `a` or from `a` to HRAM

Set `a` to some constant minus `a`

Set `a` to one constant or another depending on the carry flag

Increment or decrement `a` when the carry flag is set

Increment or decrement `a` when the carry flag is not set

Toggle `a` between two different constants

Divide `a` by 8 (shift `a` right 3 bits)

Divide `a` by 16 (shift `a` right 4 bits)

Set `a` to some value plus or minus carry

Add or subtract the carry flag from a register besides `a`

Reverse the bits of `a`

Count the set bits of a register besides `a`

Merge some bits of a register with `a`

Multiply `hl` by 2

Add `a` to a 16-bit register

Set a 16-bit register to `a` plus a constant

Set a 16-bit register to `a` multiplied by 16

Sign-extend `a` into a 16-bit register

Load from an address to `hl`

Load from an address to `sp`

Load a constant into `[hl]`

Increment or decrement `[hl]`

Load a constant into `[hl]` and increment or decrement `hl`

Load a register into `[hl]` and increment or decrement `hl`

Compare `a` to 0

Compare `a` to 1

Compare `a` to 255

Compare `a` to 0 after masking it

Compare `a` to a mask after masking it

Test whether `a` is negative (compare `a` to $80)

Call `hl`

Conditional `rst $38`

Off-by-one `AddNTimes`

⚠️ GitHub.com Fallback ⚠️