Optimizing assembly code - pret/pokecrystal GitHub Wiki
Sometimes the simplest way to write something in assembly code isn't the best. All of your resources are limited: CPU speed, ROM size, RAM space, register use. You can rewrite code to use those resources more efficiently (sometimes by trading one for another).
Most of these tricks come from Jeff's GB Assembly Code Tips v1.0, WikiTI's Z80 Optimization page, z80 Heaven's optimization tutorial, and GBDev Wiki's ASM Snippets. (Note that the Game Boy CPU's assembly is called SM83, or colloquially GBZ80. It is not the same as Z80 assembly; the Z80 CPU has more registers and some different instructions.)
WikiTI's advice fully applies here:
Note that the following tricks act much like a peephole optimizer and are the last optimization step: remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.
Also note that nearly every trick turns the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on their use; comments warn about them. Some tricks apply to other cases, but again you have to be careful.
There are some tricks that are nothing more than the correct use of the available instructions on the Z80. Keeping an instruction set summary helps to visualize what you can do during coding.
(There's also a "cheat sheet" table of instructions summarizing their bytes, cycles, and affected flags, if you don't need a long listing of what each one does.)
-
8-bit registers
- Set
ato 0 - Increment or decrement
a - Multiply
aby 2 - Invert the bits of
a - Rotate the bits of
a - Load from HRAM to
aor fromato HRAM - Set
ato some constant minusa - Set
ato one constant or another depending on the carry flag - Increment or decrement
awhen the carry flag is set - Increment or decrement
awhen the carry flag is not set - Toggle
abetween two different constants - Divide
aby 8 (shiftaright 3 bits) - Divide
aby 16 (shiftaright 4 bits) - Set
ato some value plus or minus carry - Add or subtract the carry flag from a register besides
a - Reverse the bits of
a - Count the set bits of a register besides
a - Merge some bits of a register with
a
- Set
-
16-bit registers
- Multiply
hlby 2 - Add
ato a 16-bit register - Subtract a constant from a 16-bit register
- Set a 16-bit register to
aplus a constant - Set a 16-bit register to
amultiplied by 16 - Sign-extend
ainto a 16-bit register - Increment or decrement a 16-bit register
- Add or subtract the carry flag from a 16-bit register
- Load from an address to
hl - Load from an address to
sp - Negate a 16-bit register
- Exchange two 16-bit registers
- Add two 16-bit registers
bcandde - Subtract two 16-bit registers
- Load two constants into a register pair
- Load a constant into
[hl] - Increment or decrement
[hl] - Load a constant into
[hl]and increment or decrementhl - Load a register into
[hl]and increment or decrementhl
- Multiply
- Branching (control flow)
- Subroutines (functions)
- Jump and lookup tables
Don't do this:
ld a, 0 ; 2 bytes, 2 cycles; no changes to flagsInstead, do this:
xor a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1Or do this:
sub a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1Don't use the optimized versions if you need to preserve flags. As such, ld a, 0 must be left intact in the code below:
ld a, [wIsTrainerBattle]
and a ; sets flag Z to 1 if [wIsTrainerBattle] == 0 or else to 0
ld a, 0 ; sets a to 0 without affecting flags
jr nz, .is_trainer_battle
... ; is not trainer battleWhen possible, avoid doing this:
add 1 ; 2 bytes, 2 cycles; sets carry for -1 to 0 overflow sub 1 ; 2 bytes, 2 cycles; sets carry for 0 to -1 underflowIf you don't need to set the carry flag, then do this:
inc a ; 1 byte, 1 cycle dec a ; 1 byte, 1 cycleDon't do this:
sla a ; 2 bytes, 2 cyclesInstead, do this:
add a ; 1 byte, 1 cycleDon't do this:
xor $ff ; 2 bytes, 2 cyclesInstead, do this:
cpl ; 1 byte, 1 cycleDon't do this:
rl a ; 2 bytes, 2 cycles; updates Z and C flags rlc a ; 2 bytes, 2 cycles; updates Z and C flags rr a ; 2 bytes, 2 cycles; updates Z and C flags rrc a ; 2 bytes, 2 cycles; updates Z and C flagsInstead, do this:
rla ; 1 byte, 1 cycle; updates C flag rlca ; 1 byte, 1 cycle; updates C flag rra ; 1 byte, 1 cycle; updates C flag rrca ; 1 byte, 1 cycle; updates C flagThe exception is if you need to set the zero flag when the operation results in 0 for a; the two-byte operations can set z, the one-byte operations cannot.
Don't do this:
ld a, [hFoobar] ; 3 bytes, 4 cycles ld [hFoobar], a ; 3 bytes, 4 cyclesInstead, do this:
ldh a, [hFoobar] ; 2 bytes, 3 cycles ldh [hFoobar], a ; 2 bytes, 3 cycles("What's foobar?")
Don't do this:
; 4 bytes, 4 cycles
ld b, a
ld a, FOOBAR
sub bInstead, do this:
; 3 bytes, 3 cycles
cpl
add FOOBAR + 1Or if the constant is zero (i.e. FOOBAR == 0 and FOOBAR - a == -a), then do this:
; 2 bytes, 2 cycles
cpl
inc aOr if the constant is $FF (aka −1) (i.e. FOOBAR == $FF and FOOBAR - a == ~a), then do this:
; 1 byte, 1 cycle
cpl(The example sets a to CVAL if the carry flag - (c), or NCVAL is the carry flag is not set (nc).)
Don't do this:
; 6 bytes, 6 or 7 cycles
ld a, CVAL
jr c, .carry
ld a, NCVAL
.carryAnd don't do this:
; 6 bytes, 6 or 7 cycles
ld a, NCVAL
jr nc, .no_carry
ld a, CVAL
.no_carryAnd if either is 0, don't do this:
; 5 bytes, 5 cycles
ld a, CVAL ; nor NCVAL
jr c, .carry ; nor jr nc
xor a
.carryAnd if either is 1 more or less than the other, don't do this:
; 5 bytes, 5 cycles
ld a, CVAL ; nor NCVAL
jr c, .carry ; nor jr nc
inc a ; nor dec a
.carryInstead use sbc a, which copies the carry flag to all bits of a. So do this:
; 5 bytes, 5 cycles
sbc a ; if carry, then $ff, else 0
and CVAL - NCVAL ; $ff becomes CVAL - NCVAL, 0 stays 0
add NCVAL ; CVAL - NCVAL becomes CVAL, 0 becomes NCVALOr do this:
; 5 bytes, 5 cycles
sbc a ; if carry, then $ff, else 0
and CVAL ^ NCVAL ; $ff becomes CVAL ^ NCVAL, 0 stays 0
xor NCVAL ; CVAL ^ NCVAL becomes CVAL, 0 becomes NCVALAnd if certain conditions apply, then do something more efficient:
| If this case... | ...then do this: |
|---|---|
|
|
; 1 byte, 1 cycle
sbc a ; if carry, then $ff, else 0 |
|
|
; 2 bytes, 2 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff |
|
|
; 2 bytes, 2 cycles
sbc a ; if carry, then $ff aka -1, else 0
inc a ; -1 becomes 0, 0 becomes 1 |
|
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff, else 0
or NCVAL ; $ff stays $ff, $00 becomes NCVAL |
|
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff, else 0
and CVAL ; $ff becomes CVAL, 0 stays 0 |
|
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff aka -1, else 0
add NCVAL ; -1 becomes NCVAL - 1 aka CVAL, 0 becomes NCVAL |
|
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff aka -1, else 0; doesn't change carry
sbc -NCVAL ; -1 becomes NCVAL - 2 aka CVAL, 0 becomes NCVAL |
|
|
; 3 bytes, 3 cycles
ld a, NCVAL / 2
adc a ; a = a * 2 + carry |
|
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff
and NCVAL ; 0 stays 0, $ff becomes NCVAL |
|
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff
or CVAL ; $00 becomes CVAL, $ff stays $ff |
|
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff aka -1
add CVAL ; -1 becomes CVAL - 1 aka NCVAL, 0 becomes CVAL |
|
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if carry, then 0, else $ff aka -1; doesn't change carry
sbc -CVAL ; -1 becomes CVAL - 2 aka NCVAL, 0 becomes CVAL |
Don't do this:
; 3 bytes, 3 cycles
jr nc, .ok
inc a
.ok ; 3 bytes, 3 cycles
jr nc, .ok
dec a
.okInstead, do this:
adc 0 ; 2 bytes, 2 cycles sbc 0 ; 2 bytes, 2 cyclesDon't do this:
; 3 bytes, 3 cycles
jr c, .ok
inc a
.ok ; 3 bytes, 3 cycles
jr c, .ok
dec a
.okInstead, do this:
sbc -1 ; 2 bytes, 2 cycles adc -1 ; 2 bytes, 2 cyclesDon't do this:
; 12 bytes, 9 or 10 cycles
cp FOO
jr z, .foo_to_bar
jr .bar_to_foo
.foo_to_bar
ld a, BAR
jr .done
.bar_to_foo
ld a, FOO
.done
...And don't do this:
; 10 bytes, 7 or 9 cycles
cp FOO
jr z, .foo_to_bar ; nor jr nz, .bar_to_foo
ld a, FOO ; nor ld a, BAR
jr .done
.foo_to_bar ; nor .bar_to_foo
ld a, BAR ; nor ld a, FOO
.done
...(That would be applying the "Conditional fallthrough" optimization to the first way.)
Instead, do this:
xor FOO ^ BAR ; 2 bytes, 2 cycles(This works for the same reason as the XOR swap algorithm for swapping the values of two variables.)
Don't do this:
; 6 bytes, 9 cycles
; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide)
ld c, 8 ; divisor
call SimpleDivide
ld a, b ; quotientAnd don't do this:
; 6 bytes, 6 cycles
srl a
srl a
srl aInstead, do this:
; 5 bytes, 5 cycles
rrca
rrca
rrca
and %00011111Don't do this:
; 6 bytes, 9 cycles
; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide)
ld c, 16 ; divisor
call SimpleDivide
ld a, b ; quotientAnd don't do this:
; 8 bytes, 8 cycles
srl a
srl a
srl a
srl aInstead, do this:
; 4 bytes, 4 cycles
swap a
and $f(The example uses b and c, but any registers besides a would also work, including [hl].)
Don't do this:
; 4 bytes, 4 cycles
ld b, a
ld a, c
adc 0 ; 4 bytes, 4 cycles
ld b, a
ld a, c
sbc 0And don't do this:
; 4 bytes, 4 cycles
ld b, a
ld a, 0
adc c ; 4 bytes, 4 cycles
ld b, a
ld a, 0
sbc cInstead, do this:
; 3 bytes, 3 cycles
ld b, a
adc c
sub b ; 3 bytes, 3 cycles
ld b, a
sbc b
add cAlso, don't do this:
; 5 bytes, 5 cycles
ld b, a
ld a, N
adc 0 ; 5 bytes, 5 cycles
ld b, a
ld a, N
sbc 0And don't do this:
; 5 bytes, 5 cycles
ld b, a
ld a, 0
adc N ; 5 bytes, 5 cycles
ld b, a
ld a, 0
sbc NInstead, do this:
; 4 bytes, 4 cycles
ld b, a
adc N
sub b ; 4 bytes, 4 cycles
ld b, a
sbc b
add N(If the original value of a was not backed up in b, this optimization would not apply.)
(The example uses b, but any of c, d, e, h, l, or [hl] would also work.)
Don't do this:
; 4 bytes, 4 cycles
ld a, b
adc 0
ld b, a ; 4 bytes, 4 cycles
ld a, b
sbc 0
ld b, aAnd don't do this:
; 4 bytes, 4 cycles
ld a, 0
adc b
ld b, a ; 4 bytes, 4 cycles
ld a, 0
sbc b
ld b, aInstead, do this:
; 3 bytes, 3 or 4 cycles
jr nc, .no_carry
inc b
.no_carry ; 3 bytes, 3 or 4 cycles
jr nc, .no_carry
dec b
.no_carry(This optimization is based on Retro Programming).
(The example uses b, but any of c, d, e, h, l, or [hl] would also work.)
Don't do this:
; 26 bytes, 26 cycles
rept 8
rra ; nor rla
rl b ; nor rr b
endr
ld a, bAnd don't do this:
; 17 bytes, 17 cycles
ld b, a
rlca
rlca
xor b
and $aa
xor b
ld b, a
rlca
rlca
rlca
rrc b
xor b
and $66
xor b(That would be applying the exact Z80 optimization from Retro Programming, without any SM83-specific operations.)
Instead, do this:
; 15 bytes, 15 cycles
ld b, a
rlca
rlca
xor b
and $aa
xor b
ld b, a
swap b
xor b
and $33
xor b
rrcaOr if you really want to optimize for size over speed, then don't do this:
; 10 bytes, 59 cycles
ld bc, 8 ; lb bc, 0, 8
.loop
rra ; nor rla
rl b ; nor rr b
dec c
jr nz, .loop
ld a, bInstead, do this:
; 8 bytes, 50 cycles
ld b, 1
.loop
rra
rl b
jr nc, .loop
ld a, bOr if you can spare hl, then do this:
; 7 bytes, 50 cycles
ld h, a
ld a, $80
.loop
add hl, hl
rra
jr nc, .loopOr if you really want to optimize for speed over total size, then do this:
; 6 bytes, 12 cycles
; (4 bytes, 5 cycles if you don't need the push hl/pop hl)
push hl
ld h, HIGH(ReversedBitTable)
ld l, a
ld a, [hl]
pop hl ; 256 bytes; placed in ROM0 or the same ROMX section as the bit reversal
SECTION "ReversedBitTable", ROM0, ALIGN[8]
ReversedBitTable::
for x, 256
; https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith32Bits
db LOW(((((x * $802) & $22110) | ((x * $8020) & $88440)) * $10101) >> 16)
endr(This optimization is based on WikiTI).
(The examples count the set bits of c and may also use b, but any registers besides a would also work.)
Don't do this:
; 26 bytes, 26 cycles
xor a
ld b, a
rept 8
rrc c
adc b
endrInstead, do this:
; 20 bytes, 20 cycles
ld a, c
and $aa
cpl
rrca
adc c
ld b, a
and $33
ld c, a
xor b
rrca
rrca
add c
ld c, a
swap a
add c
and $0fOr if you want to optimize for size over speed, then don't do this:
; 12 bytes, 68 cycles; counts bits in c, uses b
ld a, c
ld bc, $800 ; lb bc, 8, 0
.loop
add a
jr nc, .next
inc c
.next
dec b
jr nz, .loop
ld a, cAnd don't do this:
; 11 bytes, up to 67 cycles; counts bits in c
ld a, c
ld c, 0
.loop
add a
jr nc, .next
inc c
.next
and a
jr nz, .loop
ld a, cAnd don't do this:
; 9 bytes, up to 63 cycles; counts bits in c
ld a, -1
.next
inc a
.loop
srl c
jr c, .next
jr nz, .loopBut do this:
; 7 bytes, up to 49 cycles; counts bits in c
ld a, c
add c
.loop
sub c
srl c
jr nz, .loop(This optimization is based on Bit Twiddling Hacks).
(The example uses b, but any of c, d, e, h, l, or [hl] would also work.)
Don't do this:
; 7 bytes, 7 cycles; sets a = (a & MASK) | (b & ~MASK)
and MASK
ld c, a
ld a, b
and ~MASK ; or $ff ^ MASK, or $ff - MASK
or cInstead, do this:
; 4 bytes, 4 cycles; no third register
xor b
and MASK
xor b(For example, if MASK were $f0, then ~MASK would be $0f, and this would merge the high nybble of a with the low nybble of b.)
Or if you can spare hl, then don't do this:
; 9 bytes, 10 cycles; sets a = (a & MASK) | ([wFoobar] & ~MASK)
and MASK
ld c, a
ld a, [wFoobar]
and ~MASK
or cInstead, do this:
; 7 bytes, 9 cycles; uses hl
ld hl, wFoobar
xor [hl]
and MASK
xor [hl]Don't do this:
; 4 bytes, 4 cycles
sla l
rl hInstead, do this:
add hl, hl ; 1 byte, 2 cycles(The example uses hl, but bc or de would also work.)
Don't do this:
; 6 bytes, 6 cycles
add l
ld l, a
ld a, 0
adc h
ld h, aAnd don't do this:
; 6 bytes, 6 cycles
add l
ld l, a
ld a, h
adc 0
ld h, aAnd don't do this:
; 5 bytes, 5 cycles
add l
ld l, a
jr nc, .no_carry
inc h
.no_carryInstead, do this:
; 5 bytes, 5 cycles; no labels
add l
ld l, a
adc h
sub l
ld h, aOr if you can spare another 16-bit register and want to optimize for size over speed, then do this:
; 4 bytes, 5 cycles
ld d, 0
ld e, a
add hl, de(The example uses hl, but bc or de would also work.)
Do this:
; 8 bytes, 8 cycles
ld a, l
sub LOW(FooBar)
ld l, a
ld a, h
sbc HIGH(FooBar)
ld h, aOr if the constant is 8-bit (i.e. HIGH(FooBar) == 0), then do this:
; 7 bytes, 7 cycles
ld a, l
sub FooBar
ld l, a
jr nc, .no_carry
dec h
.no_carry(This is a case of "Add or subtract the carry flag from a register besides a", applied to the high part of a 16-bit register.)
Or if you can spare another 16-bit register, do this:
; 4 bytes, 5 cycles
ld de, -FooBar
add hl, de(The example uses hl, but bc or de would also work.)
Don't do this:
; 7 bytes, 8 cycles; uses another 16-bit register
ld e, a
ld d, 0
ld hl, FooBar
add hl, deAnd don't do this:
; 8 bytes, 8 cycles
ld hl, FooBar
add l
ld l, a
adc h
sub l
ld h, aAnd don't do this:
; 8 bytes, 8 cycles
ld h, HIGH(FooBar)
add LOW(FooBar)
ld l, a
jr nc, .no_carry
inc h
.no_carryInstead, do this:
; 7 bytes, 7 cycles
add LOW(FooBar)
ld l, a
adc HIGH(FooBar)
sub l
ld h, aOr if the constant is 8-bit and nonzero (i.e. 0 < FooBar < 256), then do this:
; 6 bytes, 6 cycles
sub LOW(-FooBar)
ld l, a
sbc a
inc a
ld h, aOr if the constant is zero (i.e. FooBar == 0 and a + FooBar == a), then do this:
; 3 bytes, 3 cycles
ld l, a
ld h, 0(The example uses hl, but bc or de would also work.)
You can do this:
; 7 bytes, 11 cycles
ld l, a
ld h, 0
add hl, hl
add hl, hl
add hl, hl
add hl, hl ; 7 bytes, 11 cycles
ld l, a
ld h, 0
rept 4
add hl, hl
endrBut if a is definitely small enough, and its value can be changed, then do one of these:
; 7 bytes, 10 cycles; sets a = a * 2; requires a < $80
add a
ld l, a
ld h, 0
add hl, hl
add hl, hl
add hl, hl ; 7 bytes, 9 cycles; sets a = a * 4; requires a < $40
add a
add a
ld l, a
ld h, 0
add hl, hl
add hl, hl ; 7 bytes, 8 cycles; sets a = a * 8; requires a < $20
add a
add a
add a
ld l, a
ld h, 0
add hl, hl ; 5 bytes, 5 cycles; sets a = a * 16; requires a < $10
swap a
ld l, a
ld h, 0Or if the value of a can be changed and you want to optimize for speed over size, then do one of these:
; 8 bytes, 8 cycles; sets a = l
swap a
ld l, a
and $f
ld h, a
xor l
ld l, a ; 8 bytes, 8 cycles; sets a = h
swap a
ld h, a
and $f0
ld l, a
xor h
ld h, a(This optimization is based on Plutiedev and GBDev Wiki's ASM Snippets.)
(The example uses hl, but bc or de would also work.)
Don't do this:
; 10 bytes, 9 or 10 cycles
ld l, a
cp $80 ; nor bit 7, a
ld a, $00
jr c, .ok ; nor jr z, .ok
ld a, $ff
.ok
ld h, aAnd don't do these:
; 9 bytes, 8 or 9 cycles
ld l, a
cp $80 ; nor bit 7, a
ld a, $00
jr c, .ok ; nor jr z, .ok
dec a
.ok
ld h, a ; 9 bytes, 8 or 9 cycles
ld l, a
cp $80 ; nor bit 7, a
ld a, $ff
jr nc, .ok ; nor jr nz, .ok
inc a
.ok
ld h, aAnd don't do these:
; 9 bytes, 8 or 9 cycles
ld l, a
rlca ; nor add a
ld a, $00
jr nc, .ok
ld a, $ff
.ok
ld h, a ; 9 bytes, 8 or 9 cycles
ld l, a
rlca ; nor add a
ld a, $ff
jr c, .ok
ld a, $00
.ok
ld h, a(Those would be applying the "Test whether a is negative (compare a to $80)" optimization to the first way.)
And don't do this:
; 6 bytes, 6 cycles
ld l, a
cp $80
ccf
sbc a
ld h, a(That would be applying the "Set a to one constant or another depending on the carry flag" optimization to the first way.)
Instead, do this:
; 4 bytes, 4 cycles
ld l, a
rlca ; or add a
sbc a
ld h, a(That applies both optimizations to the first way.)
When possible, avoid doing this:
inc hl ; 1 byte, 2 cycles dec hl ; 1 byte, 2 cyclesIf the low byte definitely won't overflow, then do this:
inc l ; 1 byte, 1 cycle dec l ; 1 byte, 1 cycleThis is applicable, for instance, if you're reading a data table via hl one byte at a time, it has no more than 256 entries, and it's in its own SECTION which has been ALIGNed to 8 bits. It's unlikely to apply to pokecrystal's existing systems.
(The example uses hl, but bc or de would also work.)
Don't do this:
; 8 bytes, 8 cycles
ld a, l ; nor ld a, 0
adc 0 ; nor adc l
ld l, a
ld a, h ; nor ld a, 0
adc 0 ; nor adc h
ld h, a ; 8 bytes, 8 cycles
ld a, l ; nor ld a, 0
sbc 0 ; nor sbc l
ld l, a
ld a, h ; nor ld a, 0
sbc 0 ; nor sbc h
ld h, aAnd don't do this:
; 7 bytes, 7 cycles
ld a, l ; nor ld a, 0
adc 0 ; nor adc l
ld l, a
adc h
sub l
ld h, a ; 7 bytes, 7 cycles
ld a, l ; nor ld a, 0
sbc 0 ; nor sbc l
ld l, a
sbc h
add l
ld h, a(That would be applying the "Set a to some value plus or minus carry" optimization to part of the first way.)
And don't do this:
; 7 bytes, 7 or 8 cycles
ld a, l ; nor ld a, 0
adc 0 ; nor adc l
ld l, a
jr nc, .no_carry
inc h
.no_carry ; 7 bytes, 7 or 8 cycles
ld a, l ; nor ld a, 0
sbc 0 ; nor sbc l
ld l, a
jr nc, .no_carry
dec h
.no_carry(That would be applying the "Add or subtract the carry flag from a register besides a" optimization to part of the first way.)
Instead, do this:
; 3 bytes, 4 or 5 cycles
jr nc, .no_carry
inc hl
.no_carry ; 3 bytes, 4 or 5 cycles
jr nc, .no_carry
dec hl
.no_carryDon't do this:
; 8 bytes, 10 cycles
ld a, [wFoobar] ; LSB first
ld l, a
ld a, [wFoobar+1]
ld h, aInstead, do this:
; 6 bytes, 8 cycles
ld hl, wFoobar
ld a, [hli]
ld h, [hl]
ld l, aAnd don't do this:
; 8 bytes, 10 cycles
ld a, [wFoobar] ; MSB first
ld h, a
ld a, [wFoobar+1]
ld l, aInstead, do this:
; 6 bytes, 8 cycles
ld hl, wFoobar
ld a, [hli]
ld l, [hl]
ld h, aDon't do this:
; 9 bytes, 12 cycles
ld a, [wFoobar]
ld l, a
ld a, [wFoobar+1]
ld h, a
ld sp, hlAnd don't do this:
; 7 bytes, 10 cycles
ldh a, [hFoobar]
ld l, a
ldh a, [hFoobar+1]
ld h, a
ld sp, hlAnd don't do this:
; 7 bytes, 10 cycles
ld hl, wFoobar
ld a, [hli]
ld h, [hl]
ld l, a
ld sp, hl(That would be applying the "Load from an address to hl" optimization to the first way.)
Instead, do this:
; 5 bytes, 8 cycles
ld sp, wFoobar
pop hl
ld sp, hlOr if the address is already in hl, then don't do this:
; 4 bytes, 7 cycles
ld a, [hli]
ld h, [hl]
ld l, a
ld sp, hlInstead, do this:
; 3 bytes, 7 cycles
ld sp, hl
pop hl
ld sp, hl(The example uses hl, but bc or de would also work.)
Don't do this:
; 9 bytes, 10 cycles
ld a, $ff
xor h
ld h, a
ld a, $ff
xor l
ld l, a
inc hlAnd don't do this:
; 7 bytes, 8 cycles
ld a, h
cpl
ld h, a
ld a, l
cpl
ld l, a
inc hlAnd don't do this:
; 7 bytes, 7 cycles
xor a
sub l
ld l, a
ld a, 0
sbc h
ld h, aInstead, do this:
; 6 bytes, 6 cycles
xor a
sub l
ld l, a
sbc a
sub h
ld h, a(The example uses hl and de, but any pair of bc, de, or hl would also work.)
If you care about speed, then do this:
; 6 bytes, 6 cycles
ld a, d
ld d, h
ld h, a
ld a, e
ld e, l
ld l, aIf you care about size, then do this:
; 4 bytes, 9 cycles
push de
ld d, h
ld e, l
pop hl(The example adds bc to de, but adding de to bc would also work.)
Do this:
; 6 bytes, 6 cycles
ld a, e
add c
ld e, a
ld a, d
adc b
ld d, aOr if you can spare hl and really want to optimize for size, then do this:
; 5 bytes, 6 cycles; uses hl
ld h, d
ld l, e
add hl, bc
ld d, h
ld e, lOr if hl is one of the registers you want to add, then do this:
; 1 byte, 2 cycles
add hl, bc(The example uses hl and de, but any pair of bc, de, or hl would also work.)
Don't do this:
; 8 bytes, 10 cycles; modifies subtrahend de
ld a, d
cpl
ld d, a
ld a, e
cpl
ld e, a
inc de
add hl, deAnd don't do this:
; 7 bytes, 8 cycles; modifies subtrahend de
xor a
sub e
ld e, a
sbc a
sub d
ld d, a
add hl, deInstead, do this:
; 6 bytes, 6 cycles
ld a, l
sub e
ld l, a
ld a, h
sbc d
ld h, a(The example uses bc, but hl or de would also work.)
Don't do this:
; 4 bytes, 4 cycles
ld b, FOO
ld c, BARInstead, do this:
ld bc, FOO << 8 | BAR ; 3 bytes, 3 cyclesOr better, use the lb macro in macros/code.asm:
lb bc, FOO, BAR ; 3 bytes, 3 cyclesDon't do this:
; 3 bytes, 4 cycles
ld a, FOOBAR
ld [hl], aInstead, do this:
ld [hl], FOOBAR ; 2 bytes, 3 cyclesDon't do this:
; 3 bytes, 5 cycles
ld a, [hl]
inc a
ld [hl], a ; 3 bytes, 5 cycles
ld a, [hl]
dec a
ld [hl], aInstead, do this:
inc [hl] ; 1 bytes, 3 cycles dec [hl] ; 1 bytes, 3 cyclesDon't do this:
; 2 bytes, 4 cycles
ld [hl], a
inc hl ; 2 bytes, 4 cycles
ld [hl], a
dec hlInstead, do this:
ld [hli], a ; 1 bytes, 2 cycles ld [hld], a ; 1 bytes, 2 cyclesAnd if you can use a, then don't do this:
; 3 bytes, 5 cycles
ld [hl], FOO
inc hl ; 3 bytes, 5 cycles
ld [hl], FOO
dec hlInstead, do this:
; 3 bytes, 4 cycles
ld a, FOO
ld [hli], a ; 3 bytes, 4 cycles
ld a, FOO
ld [hld], a(The example uses b, but any of c, d, e, h, or l would also work.)
Do this:
; 2 bytes, 4 cycles
ld [hl], b
inc hl ; 2 bytes, 4 cycles
ld [hl], b
dec hlOr if you can use a, then do this:
; 2 bytes, 3 cycles
ld a, b
ld [hli], a ; 2 bytes, 3 cycles
ld a, b
ld [hld], aDon't do this:
jp Somewhere ; 3 bytes, 4 cyclesInstead, do this:
jr Somewhere ; 2 bytes, 3 cyclesThis only applies if Somewhere is within ±128 bytes of the jump.
You can define a jmp macro to use instead of jp, which will warn you when it can be jr instead:
MACRO jmp
jp \#
assert warn, (\<_NARG>) - @ > 127 || (\<_NARG>) - @ < -129, "jp can be jr"
ENDMDon't do this:
cp 0 ; 2 bytes, 2 cyclesAnd don't do this:
or 0 ; 2 bytes, 2 cyclesAnd don't do this:
and $ff ; 2 bytes, 2 cyclesInstead, do this:
or a ; 1 byte, 1 cycleOr do this:
and a ; 1 byte, 1 cycleDo this:
cp 1 ; 2 bytes, 2 cycles; updates Z and C flagsOr if you don't care about the value in a, and don't need to set the carry flag, then do this:
dec a ; 1 byte, 1 cycle; decrements a, updates Z flagNote that you can still do inc a afterwards, which is one cycle faster if the jump is taken. Compare this:
; 4 bytes, 4 or 5 cycles
cp 1
jr z, .equals1with this:
; 4 bytes, 4 cycles
dec a
jr z, .equals1
inc a(255, or $FF in hexadecimal, is the same as −1 due to two's complement.)
Do this:
cp $ff ; 2 bytes, 2 cycles; updates Z and C flagsOr if you don't care about the value in a, and don't need to set the carry flag, then do this:
inc a ; 1 byte, 1 cycle; increments a, updates Z flagNote that you can still do dec a afterwards, which is one cycle faster if the jump is taken. Compare this:
; 4 bytes, 4 or 5 cycles
cp $ff
jr z, .equals255with this:
; 4 bytes, 4 cycles
inc a
jr z, .equals255
dec aDon't do this:
; 3 bytes, 3 cycles; sets zero flag if a == 0
and MASK
and aInstead, do this:
and MASK ; 2 bytes, 2 cycles; sets zero flag if a == 0Don't do this:
; 4 bytes, 4 cycles; sets zero flag if a == MASK and carry flag if a < MASK
and MASK
cp MASKIf you don't need to set the carry flag, and don't need the masked value of a, then do this:
; 3 bytes, 3 cycles; sets zero flag if a was equal to MASK
or ~MASK ; or $ff ^ MASK, or $ff - MASK
inc aOr do this:
; 3 bytes, 3 cycles; sets zero flag if a was equal to MASK
cpl
and MASKIf you don't need to preserve the value in a, then don't do this:
; 4 bytes, 4 or 5 cycles
cp $80
jr nc, .negativeAnd don't do this:
; 4 bytes, 4 or 5 cycles
bit 7, a
jr nz, .negativeInstead, do this:
; 3 bytes, 3 or 4 cycles; modifies a
rlca ; or add a
jr c, .negativeDon't do this:
; 4 bytes, 10 cycles
call Function
retInstead, do this:
jp Function ; 3 bytes, 4 cyclesDon't do this:
; 5 bytes, 8 cycles
(some code)
ld de, .return
push de
jp hl
.return:
(some more code)Instead, do this:
; 3 bytes, 6 cycles
; (4 bytes, 7 cycles, counting the definition of _hl_)
(some code)
call _hl_
(some more code)_hl_ is a routine already defined in home/call_regs.asm:
_hl_::
jp hlDon't do this:
; 4 additional bytes, 10 additional cycles
(some code)
call Function
(some more code)
Function:
(function code)
retif Function is only called a handful of times. Instead, do:
(some code)
; Function
(function code)
(some more code)You shouldn't do this if Function used any returns besides the one at the very end, or if inlining its code would make some jrs too distant from their targets.
Don't do this:
(some code)
call Function
ret
Function:
(function code)
retAnd don't do this:
(some code)
jp Function
Function:
(function code)
retInstead, do this:
(some code)
; fallthrough
Function:
(function code)
retFallthrough is what you get when you combine inlining with tail calls. You can still call Function elsewhere, but one tail call can be optimized into a fallthrough.
(The example uses z, but nz, c, or nc would also work.)
Don't do this:
(some code)
jr z, .foo
jr .bar
.foo
(foo code)
.bar
(bar code)Instead, do this:
(some code)
jr nz, .bar
; fallthrough
.foo
(foo code)
.bar
(bar code)(The example uses z, but nz, c, or nc would also work.)
Don't do this:
; 3 bytes, 3 or 6 cycles
jr z, .skip
ret
.skip
...And don't do this:
; 3 bytes, 7 or 2 cycles
jr nz, .return
...
.return
retInstead, do this:
; 1 byte, 5 or 2 cycles
ret nz
...(The example uses z, but nz, c, or nc would also work.)
Don't do this:
; 5 bytes, 3 or 9 cycles
jr nz, .skip
call Foo
.skipInstead, do this:
; 3 bytes, 6 or 3 cycles
call z, FooAnd don't do this:
; 5 bytes, 3 or 9 cycles
jr nz, .skip
jp Foo
.skipInstead, do this:
; 3 bytes, 6 or 3 cycles
jp z, Foo(The example uses z, but nz, c, or nc would also work.)
Don't do this:
; 5 bytes, 3 or 14 cycles
call z, RstVector38
...
RstVector38:
rst $38
retAnd don't do this:
; 3 bytes, 3 or 6 cycles
jr nz, .no_rst_38
rst $38
.no_rst_38
...And don't do this:
; 3 bytes, 3 or 6 cycles
call z, $0038
...Instead, do this:
; 2 bytes, 2 or 7 cycles
jr z, @ + 1 ; the byte for @ + 1 is $ff, which is the opcode for rst $38
...(The label @ evaluates to the current pc value, which in jr z, @ + 1 is right before the jr instruction. The instruction consists of two bytes, the opcode and the relative offset. @ + 1 evaluates to in-between those two bytes. The jr instruction encodes its offset relative to the end of the instruction, i.e. the next pc value after the instruction has been read, so the relative offset is -1, aka $ff.)
Don't do this:
; 2 bytes, 5 cycles
ei
retInstead, do this:
; 1 byte, 4 cycles
retiDon't do this:
cp 1
jr z, .equals1
cp 2
jr z, .equals2
cp 3
jr z, .equals3
...Instead, do this:
dec a
jr z, .equals1
dec a
jr z, .equals2
dec a
jr z, .equals3
...Or do this:
dec a
ld hl, .jumptable
ld e, a
ld d, 0
add hl, de
add hl, de
ld a, [hli]
ld h, [hl]
ld l, a
jp hl
.jumptable:
dw .equals1
dw .equals2
dw .equals3
...Or better, do:
dec a
ld hl, .jumptable
rst JumpTable
ret
.jumptable:
dw .equals1
dw .equals2
dw .equals3
...JumpTable is an rst routine already defined in home/header.asm:
JumpTable::
push de
ld e, a
ld d, 0
add hl, de
add hl, de
ld a, [hli]
ld h, [hl]
ld l, a
pop de
jp hlDon't do this:
ld hl, Foo
ld bc, BAR
dec a
call AddNTimesInstead, as long as you don't need to add 255 times when a is 0, then do this:
ld hl, Foo - BAR
ld bc, BAR
call AddNTimes