Assembly Language Syntax - michaelkamprath/bespokeasm GitHub Wiki

General Assembler Syntax

The generally used syntax of a line of assembly code is one of the following forms:

{label:} instruction {operands} {; comment}

constant_assignment {; comment}

directive {; comment}

{directive} label: {; comment}

{label:} data_directive {data_value} {; comment}

; {comment}

Where items in curly-braces {..} are optional.

Furthermore:

  • Comments are prefixed with the semicolon ;. Any characters after and including the first semicolon ; on any given line are consider to be comments
  • Each line may contain multiple instructions, labels, or directives with the following limitations:
    • Only one line comment is recognized per line.
    • Instructions can only be followed by other instructions or a line comment.
    • Preprocessor directives cannot share the line with any instruction, label or other directive.
    • While there might be some specific scenarios in which multiple sequential instructions on one line brings clarity and conciseness to the assembly code, having more than one instruction per line is generally discouraged because the assembly code will become difficult to debug and error messages can be confusing.
  • Whitespace is generally ignored except for the minimal amount required to separate parts in a line of code. Indentation can be used for code readability reasons with no impact on assembled code.
  • There is no explicit limit on line length.
  • Assembly code file extensions are not interpreted by the compiler, but they are used for syntax highlighting if supported by your editor. See Installing Language Extensions for more information.

Numeric Literals

Anytime a numeric literals is to be expressed, whether it be a immediate value or a memory address, it can be written in decimal, hex, binary form, or single character ordinal values as shown here:

Type Syntax
Decimal 124
Hex $7C
Hex 0x7C
Hex 7CH
Binary b01111100
Binary %01111100
Character Ordinal '|'

Numeric Expressions

Numeric expressions that can be resolved at compile time are supported. A numeric expression can be composed of any number of explicit numeric literals, address labels, constant labels, or numeric operators. The supported operators are:

Operator Description Comment
+ Addition
- Subtraction
* Multiply
/ Divide
% Modulo So as to not be confused with a binary number, modulo operator % should be bounded by whitespace when used
& Bit-wise AND
| Bit-wise OR
^ Bit-wise XOR
BYTEx(..) Byte value Returns the value of the x-th byte of the expression contained in parenthesis. The x can only be a single digit, so BYTE0(..) through BYTE9(..) are allowed. Uses 0-based indexing where 0 is least significant byte regardless of endian setting.
LSB(..) LSB Equivalent to BYTE0(..)
(..) Expression grouping Parenthesis must be paired.

Note that numeric expressions are not the same thing as an offset for a register indirect addressing mode, though the offset value can be expressed as a numeric expression.

Single Character Ordinals

The ASCII value of a single character may be used as an integer anywhere a numeric expression can otherwise be used. To indicate a character ordinal value, the single character should be bounded by single quotes '. This will not work for multiple characters, nor can the single character be bounded by double quotes ".

Labels

A label is a string that can be resolved at compile time to a specific numeric value. Labels can be composed of alphanumeric characters and the underscore _, and cannot start with a number. All labels must be distinct. A label cannot have an equivalent name as any directive, and non-register labels cannot have the same name as a register label.

Label Types

Address Label

An address label represents a specific address in the byte sequence being assembled. A label does not generate byte code on its own, but can be used as an instruction argument to specify a specific address value. A label's address value is implied by its relative location among the lines to be assembled.

A label is represented by any case-sensitive alphanumeric character string the immediately precedes a colon :. There will be only one label allowed per line. However, a label can be followed by either a directive or an instruction on the same line. For example, this is valid:

a_label: .byte $22        ; directive on same line as label

Constant Label

A constant is a special label that has an explicitly assigned numeric value. Constants can be placed anywhere in the assembly code, as its value is only set by the assigned value. Assignment uses the following syntax based on the = sign or the EQU token:

constant_var = 10204
constant_var EQU 10204

Constants cannot be assign a numeric expression, they must be assigned an explicit numeric literal. Constant labels are case-sensitive.

Register Labels

A register label is defined in the instruction set configuration file. It is used to represent hardware registers in operands of instructions. Note that address and constant labels cannot use a string that has been declared a register label.

Label Scope

Both address labels and constant labels can be defined to be applicable only in a given scope. A scope defines to what extent a label is visible and usable by other lines of code. The allowed scopes are:

  • Global - The label is visible and usable by all lines of code.
  • File - The label is visible and usable by only those lines of code sourcing form the same file as the label is defined in. A label is made to be in the file scope when its name is prefixed by a _ character.
  • Local - The label is visible and usable by only those lines in between the the same two non-Local labels within the same source file that the local label is defined in between. A label is made to be in the local scope when its name is prefixed with a . character.
    • If the line of code precedes the first non-local label in a source file or is in between a .org directive and a non-local label, then local labels cannot be defined.
    • A local scope label is only definable and usable between two non-local labels in the same source file, or between a non-local label and then an .org directive in the same source file, or between a non-local label and the end of file

Memory Zones

A named memory zone is a contiguous address range in the allowable address space of a given ISA that has an alphanumeric string as an identifier. The following additional conditions apply:

  • A named memory zone must be completely contained by the allowed memory space of the configured ISA.
  • Multiple named memory zones may overlap with each other
  • When byte code is assembled, multiple byte codes assigned to the same absolute memory address is a fatal error.
  • Named memory zones are a compile time construct, and are intended to only be a means to manage memory ranges and byte code memory locations in assembly code.
  • Memory zones have a start and end absolute memory address. Byte code assigned to that memory zone with an absolute address outside of the memory zone's range will be an error.
  • A memory zone's name cannot be also used for any label.

Global Memory Zone

By default, a memory zone named GLOBAL is defined to be the full range of memory addresses allowed by the instruction set configuration file. For example, if the ISA defines a 16-bit address type, then the GLOBAL memory zone will be addresses 0x0000 though 0xFFFF.

The GLOBAL memory zone can be redefined in the ISA configuration to be a subset of what is permitted by the memory address bit size.

Instructions

Instruction are converted into byte code. It is composed of a specific instruction mnemonic and an option list of operands according to this format:

MNEMONIC [OPERAND1[, OPERAND2[...]]]
  • Instruction operands are separated by a comma
  • Instruction operands supported types are configured in the Instruction Set Configuration File.
  • Instruction matching is not case sensitive.

Addressing Modes

BespokeASM supports several addressing mode notations for instruction operands, though the precise meaning of each is defined by the instruction set configuration file and the hardware that the instruction set will run on. Explained here is the nominal application of each addressing mode notation.

Mode Notation Description Decorator Placement Hardware Expectations
Immediate numeric_expression A constant value to be used as an operand. The constant value is indicated by a numeric expression. - Values embedded I program byte code should be generally readable.
Indirect [numeric_expression] A value that resides at a memory address indicated by a constant value. The constant value memory address is indicated by a numeric expression. - Ability to set a memory address register or similar.
Deferred [[numeric_expression]] The numeric constant value indicated by a numeric expression represents an address at which is the value of another address where the value of interest resides. Basically, this is a doubly dereferenced memory address. Note the use of double square brackets in the notation. - Ability to follow a doubly dereferenced memory address.
Register register_label The value in a specified register. The register is indicated by a register label. Adjacent to register label.e.g.: a++ Hardware registers that are generally accessible.
Indexed Register register_label + offset_operand Indicates a value that is the combination of the register value and the offset operand value. The combination is nominally a sum (+ operator). - Ability to combine a register value with any configured offset operand source.
Indirect Register [register_label + offset] The specified register contains a memory address where the value is. An offset can be provided which should be added to the value in the register get the memory address where the desired value is. The register is indicated by a register label, and the offset is provided as a numeric expression and follows the register label with a + or - sign in between it and and the register label. Adjacent to square brackets.e.g.: [sc+5]++ Hardware registers that can set the memory address used to access memory devices. In order to support offsets, there should be the ability to produce a memory address by adding a value to the register value without necessarily changing the register value.
Indirect Indexed Register [register_label + offset_operand] Similar to Indirect Register, except that the offset can be set by any other addressing mode operand. When the configure offset operant is a numeric type, this behaves the same as Indirect Register except that the offset can only be + to the register, and there are no bounds checking on the value. The true value of this addressing mode is when the offset operand is configured to be Register, Indirect Register or Indirect value. Adjacent to square brackets.e.g.: [sc+i]++ Similar hardware needs as Indirect Register, with the general ability to set the offset value from any configured offset operand source.
Relative Address numeric_expression or {numeric_expression} Generates a relative address offset which is the difference between the expression value of this operand and the address value of current instruction, where this the current instruction's address value can be either be the program counter value before the instruction begins, or the program counter value after all machine code for the instruction has been loaded. Useful for relative jumps or data moves. Notation can be configured. - Should be able to do offsets against the program counter value.

Decorators

Addressing mode notation that supports decorators are indicted in the table above. A decorator can be used to modify an operand or even the instruction's action on the operand. The precise meaning depends on how the instruction set is configured. A prefix decorator typically indicates that the modification should occur before the instruction is executed, while a postfix decorator indicate the modification occurs after the instruction occurs. BespokeASM will recognize an operand with a decorator as a whole different operand than if the decorator wasn't present.

For example, consider the following two instructions:

mov [a],b
mov [a]+,b

A instruction set could be configured such that the first instruction means to move the value in register b into memory at the address indicated by register a (Indirect Register addressing mode), while the second instruction with the decorate could mean to move the value in register b into memory at the address indicated by register a and then increment the value in register a after the value copy has occurred.

Here are the decorators that BespokeASM supports. There specific mean is configured in the instruction set configuration.

Decorator Typical Meaning Prefix Postfix
+ Increment a value
- Decrement a value
++ Increment a value
-- Decrement a value
! customized
@ customized

Directives

Directives tell the assembler to do specific things when creating the byte code. Directives start with a period . or a hash #.

Bytecode Addressing

Memory Zone Scope

By default, code in any given source file is assembled into the GLOBAL memory zone. To set the current memory zone scope to something different, the following directive is used:

.memzone <memory zone name>

Note that the GLOBAL memory zone name can be used this directive. Subsequent assembly code lines will be compiled into the indicated memory zone scope until the end of the current assembly file or another directive that changes the memory zone scope. Addresses assigned to the byte code will be per the code ordering.

Non-contiguous uses of a given memory zone scope will be compiled as if the assembly code in each use instance was concatenated together in the order processed by the assembler.

If a source file that is currently using a non-GLOBAL memory zone includes another source file, that included source file will be compiled into the GLOBAL memory zone scope per normal file processing as described above. When compilation returns to the original source file that included the additional source file, compilation will continue using the same memory zone scope that was active when the #include directive was processed. This means that source files must always declare any non-GLOBAL emory zone scope they wish to use, and such declarations only persist for the scope of that source file.

Relative Origin within a Memory Zone

A relative origin within a memory zone can be set with the .org directive:

.org <address offset value> "<memory zone name>"

Where <address offset value> is the positive offset from the start of the specific memory zone, and <memory zone name> is the optional name of a memory zone. The <memory zone name> value is denoted by quotes so as to offset it from the <address offset value>, especially if that was set with an expression.

As an example, if a memory zone named "variables" is defined to be the range of 0x2000 through 0x2FFF, then:

.org 0x0100 "variables"

Would be the same as setting the current origin to and absolute value of 0x2100.

When using GLOBAL as the <memory zone name> then <address offset value> will be interpreted as an offset form the start of the GLOBAL memory zone as it would with any other named memory zone. If the GLOBAL memory zone has not be redefined, the net effect is the same as using .org with an absolute address. However, if the start address of the GLOBAL memory zone has been redefined in the ISA configuration file, then <address offset value> will be applied as an offset from the redefined start of GLOBAL.

The effective absolute address represented in this form of the .org directive is validated against the overall valid address range defined by the GLOBAL memory zone. If the absolute address is outside this range, then BespokeASM will emit an error.

Absolute Origin

Using the org direction without specifying a <memory zone name> will cause the <address offset value> to be interpreted as an absolute address.

For example:

.org $3400

Will set the current address to $3400. This is an absolute address value and not an offset to the GLOBAL memory zone.

The address represented in this form of the .org directive is validated against the overall valid address range defined by the GLOBAL memory zone. If the address is outside this range, then BespokeASM will emit an error.

Bytecode

There are a few byte code generation directives supported:

Directive Description
.fill N, Y Fills the next N bytes with the byte value Y
.zero N Shorthand for .fill N, 0
.zerountil X Fills the next bytes up to and including address X with the value of 0. Will emit nothing if address X is less than the address location of this directive.

Data

A data directive allows for explicitly set byte code. Like an instruction, its relative position in the assembly code defines its memory address, but unlike the instruction the byte code edited is directly defined in the assembly code. When paired with a label, a data directive can be used to define variables and other memory blocks.

The data directives have several forms, each indicating how much data is being defined:

Directive Data Value Size Data Length Endian
.byte 1 byte Variable N/A
.2byte 2 bytes Variable Default
.4byte 4 bytes Variable Default
.8byte 8 bytes Variable Default
.cstr 1 byte Variable N/A
.asciiz 1 byte Variable N/A

The syntax of usage is simply the directive followed the a data values to be written. More than one value can be provided by a comma separated list of values or labels/constants. The value assembled into the byte code will be masked by the data value size of the directive.

The .byte, .cstr, and .asciiz directives can be used to define character strings delineated by a " or '. Quotes and apostrophes within the quoted string should be escaped. The data values generated will be the ASCII values for each character in the string. Python-style character escapes (e.g., \t, \n, \x21) can be used. The .cstr and .asciiz directives can be used only with strings and will appends a configurable byte value to the end of the string. This terminating byte value defaults to zero (0), but can be configured to be a different value.

For multi-byte types (.2byte, .4byte, etc), the endian representation of each individual value uses the configured default endianness specified in the instruction set configuration file.

This example includes a label to be used to make the data's address usable elsewhere in the assembly code:

const_value = $BE

single_bytes:
    .byte $DE
    .byte $AD
    .byte const_value
    .byte $EF
byte_list:
    .byte $DE, 0xAD, const_value, $EF
str_with_no_terminating_null:
    .byte "It\'s a test string"
str_with_terminating_null:
    .cstr "It\'s a test string"

int16_value:
	.2byte $dead, $beef

int32_value:
	.4byte $deadbeef

Preprocessor

Include Other Files

Additional assembly files other than the target file indicated in the command invocation can be included in the compilation. This is done with the #include preprocessor directive. The specific format is:

#include "filename.asm"

Where filename.asm is the name of the file desired to be included. BespokeASM will search the include directories to find a file with the indicated filename. The include directory list includes the directory that contains the target file identified on command invocation, and any additional include directories identified by arguments to the command invocation.

When an assembly file is included by this directive, it is functionally equivalent to the the contents of the included file be present where the #include directive is. If .org directives are used in the included file, care should be taken such that the address of instructions do not collide between source files. BespokeASM will error if it detects that two or more instructions occupy the same address.

The inclusion of assembly files can be nested. However, BespokeASM will error if any given file ends up being included more than once.

Require Language Version

An assembly source file can require a version check of the assembly language version as identified identifier key of the General section of the assembly language configuration file being used for compilation. This is done using a #require preprocessor directive. The specific format is:

#require "language-id comparator version-string"

where:

  • language-id is the language name value in the identifier block of the general configuration section.
  • comparator is a comparison operator, such as >=, >, ==, etc. The most common comparison operator will be >=.
  • version-string is a semantic version string, e.g. 1.2.3

The version check is done at the moment the line with the #require preprocessor directive is processed. This means any given code file can have multiple #require checks. This is useful if you want to enforce a version range. For example:

#require "test-lang >= 0.5.0"
#require "test-lang < 1.0.0"

This would requires that the configuration file being used for compilation be for the language with the name test-lang and be a version between 0.5.0 inclusive and 1.0.0 exclusive.

Creating Memory Zones

A memory zone can be defined with the following directive

#create_memzone <memory zone name> <start address> <end address>

Where <memory zone name> is an alphanumeric string with no spaces which will serve as the memory zone name, <start address> is the absolute address of the start of the memory zone, and <end address> is the absolute address of the end of the memory zone. Both <start address> and <end address> must be defined with integer literals.

Any defined memory zone must be fully contained in the GLOBAL memory zone. Defining multiple memory zones with the same name is an error.

Macros

C-like preprocessor macros can be defined with the #define directive. The syntax is:

#define <symbol> <value>

Where <symbol> is the symbol for the macro that can be used elsewhere in the code, and <value> is the replacement value for wherever that symbol is used. The <value> can be left empty, which is equivalent to assigning an empty string to be the replacement value for the <symbol>. If the replacement value is intend to be interpreted as a string, it should be quoted with either single or double quotes.

Later, when a defined <symbol> is used in code, BespokeASM will immediately replace the <symbol>'s text with the defined <value> for that <symbol>. If the <symbol> is not defined when that line of code is read, then no replacement occurs. Symbol replacement is done recursively, so if one preprocessor macro symbol's replacement value is a string that contains another preprocessor mark symbol, that second preprocessor macro symbol is then replaced. This continues until nor symbol replacement occurs. An error will be generated if a symbol replacement loop is detected (e.g., symbol A is replaced by symbol B, which itself is replaced by symbol A).

Note that parametric preprocessor macro symbols (e.g., FOO(x)) are not allowed. While this feature can be used to create simple code macros in code, complex and even parametric macros should be created with Instructions Macros feature. Similarly, these preprocessor macros can be used to define constants in code, but the constant label feature is a better way to do that. The primary use case for preprocessor macros is to define symbols that can be used in compilation control.

Compilation Control

Control over which lines of assembly get compiled can be done with the C-like #if, #elif, #else, #ifdef, #ifndef, and #endif preprocessor directives. Individual lines of assembly are braced by compilation controls preprocessor directives, which control whether those lines of assembly will be compiled or not. Each can be used as follows:

  • #if <symbol-expression> <comparison> <symbol-expression> - Initiates a compilation control block by performing a comparison between two symbol expressions.
  • #if <symbol-expression> - Initiates a compilation control block by comparing a symbol expression to the implied comparison of != 0
  • #elif <symbol-expression> <comparison> <symbol-expression> - Optional. Must follow and #if or #elif compilation control directive in sequence. Creates a subordinate condition block by performing a comparison between two symbol expressions.
  • #elif <symbol-expression> - Optional. Must follow and #if or #elif compilation control directive in sequence. Creates a subordinate condition block by comparing a symbol expression to the implied comparison of != 0.
  • #ifdef <symbol> - Initiate a compilation control block by determining if a preprocessor macro symbol is defined
  • #ifndef <symbol> - Initiate a compilation control block by determining if a preprocessor macro symbol is not defined
  • #else - Option. Creates a subordinate condition block that evaluates true only if all previous subordinate condition blocks in the overall compilation control block are false. Must follow a #if, #elif, #else, #ifdef, or #ifndef.
  • #endif - Terminates a compilation control block. Must come at the a compilation control block.

A <symbol-expression> is an expression that uses preprocessor macros, numeric values, and mathematical operators, but not code labels (address labels, constants, etc). Symbol expressions used in compilation control processor directives must resolve to numeric or strings values.

As an example, consider the following lines of code:

#define SYMBOL1 "test-string"
#define SYMBOL2 57

#if SYMBOL1 == "my string"
    mov a,1
#elif SYMBOL2 > 50
    mov a,2
#else
    mov a,3
#endif

In this example, only the mov a,2 line will get compiled.

Examples

Ben Eater SAP-1

The following example using the instruction set for Ben Eater's SAP-1 Breadboard CPU.

; Count by Loop
;
; For the Ben Eater SAP-1 breadboard CPU
;

zero = 0              ; constant value for 0
one = 1               ; constant value for 1

start:
  ldi zero            ; load value of 0 into A
  out                 ; display

add_loop:
  add increment       ; add current value at 0xF to A
  jc increment_step   ; increment the step if overflow
  out                 ; display
  jmp add_loop        ; loop

increment_step:
  lda increment       ; load current increment value
  add one_value       ; add 1 to increment value
  jc restart_loops    ; if it overflows, just reset everything
  sta increment       ; save updated increment value
  jmp start           ; restart counting

restart_loops:
  ldi one             ; load the value of 1 into register A
  sta increment       ; reset the increment value to 1
  jmp start           ; restart counting

one_value:
  .byte 1             ; 1 value needed for incrementing the increment value

increment:
  .byte 1             ; storage for the current increment value

Recursion with Subroutines

Here is an example that employs an instruction set that enable subroutines (call, rts), a stack (push, pop) and indirect addressing modes. It uses 16-bit addressing and little endian. The example configuration file for this instruction set is here. Also assumes a memory map with $0000 is the start of ROM and $8000 is the start of RAM.

;
; Variables
;

.org $8000           ; variables should be in RAM
n_value:
  .byte 5            ; N value to calculate factorial for

;
; Code
;

.org 0               ; code goes in ROM 
start:
  push [n_value]     ; push the value at n_value onto the stack
  call factorial     ; jump to the factorial subroutine
  out                ; factorial results are in A register. display it
  hlt                ; done

; factorial subroutine
;
; Input:
;   stack - function return pointer
;   stack+2 - The input N value to calculate factorial. A single 8-bit value
;
; Output:
;   A register - the results of the factorial calculation. A single 8-bit value
;
; Registers used: A
;
factorial:
  mov [sp+2],a      ; copy the N value to A register
  je .end,1       ; jump to f_stop if A is 1
  sub 1             ; subtract 1 from A to get (N-1)
  push a            ; put the n-1 value on the stack
  call factorial    ; recurse into factorial
  pop               ; remove the (N-1) value from stack
  push [sp+2]       ; push the N value on the stack
  push a            ; push the factorial(n-1) results on stack
  call multiply     ; call multiply subroutine
  pop               ; pop factorial(n-1) from stack
  pop               ; pop N-value from stack
.end:               ; local-scope label indicating the end of the subroutine
  rts               ; return from subroutine. Register A contains factorial(N)

; multiply subroutine
;
; Input:
;   stack - function return pointer
;   stack+2 - A single 8-bit value to multiply
;   stack+3 - A single 8-bit value to multiply
;
; Output:
;   A register - the results of the multiply calculation. A single 8-bit value
;
; Registers use: A, I
;
multiply:
  mov [sp+2],a     ; copy the multiplicand to A
  je .zero,0       ; jump to zero handler if multiplicand is 0
  mov a,b          ; copy multiplicand to B to set up for add loop
  mov [sp+3],i     ; copy multiplier to I
  dec i            ; decrement I for 0-based loop
  jc .zero         ; was multiplier zero? If so, carry was set on the dec so jump to m_zero             
.loop:             ; local scope label indicating the start of the summation loop
  jz .end          ; jump to done if multiplier counter is now zero
  add b            ; add b to a
  dec i            ; decrement multiplier counter
  jmp .loop        ; restart addition loop
.zero:             ; local scope label indicating when a 0-multiplicand is handled
  mov a,0          ; set the return value to zero
.end:              ; local-scope label indicating the end of the subroutine
  rts              ; return from subroutine
⚠️ **GitHub.com Fallback** ⚠️