Critical Technical Details for Magic‐1 Programmers - retrotruestory/M1DEV GitHub Wiki
- 16-bit architecture with big-endian byte order
-
Three primary registers:
a
,b
,c
(general purpose) -
Special registers:
dp
(data pointer),sp
(stack pointer),pc
(program counter),msw
(machine status word/flags),ptb
(page table base) - Page size: 2048 bytes (2KB)
- Stack: Grows downward, typically initialized at 0x8000
-
Machine ID: 76 (defined as
MAGIC1
in system headers)
- Memory Model: Segmented with separate code and data spaces
- Virtual Memory: Implemented through page tables (separate for code and data)
- Protection Domains: User vs. system space, with explicit cross-domain instructions
-
Page Table Management: Uses
wdpte
andwcpte
instructions for mapping - Protection Control: MSW bit 0x80 toggles paging on/off
operation.size destination,source[,branch_target]
Examples:
ld.8 a,0x23 ; 8-bit load immediate
add.16 a,b ; 16-bit addition
cmpb.eq.8 a,b,label ; Compare and branch if equal
-
Immediate:
ld.16 a,0x1234
-
Register Indirect:
ld.8 a,0(b)
-
Base+Displacement:
ld.16 a,44(b)
-
Data Pointer Relative:
ld.16 a,513(dp)
-
PC-Relative:
lea a,123(pc)
- Z (0x1): Zero result
- N (0x2): Negative result
- C (0x4): Carry
- V (0x8): Overflow
- Paging Control: 0x80 bit
-
Native C Compiler:
clcc
(Magic-1's native compiler) -
Host C Compiler:
gcc -m32
for cross-development -
Assembler:
m1_as
(host) /as
(native) -
Linker:
m1_ld
(host) /ld
(native) -
Archiver:
m1_ar
(host) /ar
(native) -
Library Indexer:
m1_ranlib
(host) /ranlib
(native)
- Format: Modified a.out variant
-
Magic Numbers:
-
OMAGIC
(0x107): Object files/impure executables -
NMAGIC
(0x108): Pure executables -
ZMAGIC
(0x10B): Demand-paged executables
-
-
Header Flags:
-
A_EXEC
(0x10): Executable file -
A_SEP
(0x20): Separate I/D spaces -
A_PAL
(0x02): Page aligned
-
- Compile:
clcc -c source.c
→ object file - Link:
ld crt0.o objects... -lc -le crtn.o
→ executable - Index libraries:
ar rc lib.a objects... && ranlib lib.a
- Inspect:
size
,dis
,header
to analyze binaries
- Host tools prefixed with
m1_
(e.g.,m1_as
,m1_ld
) - Byte-swapping required (Magic-1 is big-endian, most hosts are little-endian)
- 32-bit host compilation (
-m32
) for compatibility with Magic-1's memory model
- crt0.o: Standard C runtime initialization
- bcrt0.o: Basic/minimal runtime (smaller footprint)
- mcrt0.o: Monitor-specific runtime (ROM boot)
- xcrt0.o: Extended runtime for bootloaders
- crtn.o: Runtime termination code
- ROM: Typically 0x0000-0x3FFF (16KB)
- RAM: Starting at 0x4000
- Stack: Typically at 0x8000, growing downward
- Heap: Follows program data section
- Device I/O: Memory mapped at high addresses (e.g., UART0 at 0xFFF0-0xFFF7)
- Arguments passed on stack
- Return values in register
a
- Registers may need preservation across calls
- Stack frames created with
enter
instruction, format:call function ; Push return address and jump enter 4 ; Create 4-byte stack frame
- 6 hardware interrupt levels (IRQ0-IRQ5)
- System call interface via interrupt mechanism
- Vector table initialized at program start
- Exceptions: overflow, privilege violation, breakpoint
- libc.a: Standard C library
-
libm.a: Math functions (must link with
-lm
) - libfp.a: Software floating-point implementation
- libe.a: Extended/hardware-specific functions
- libsys.a: System call interfaces
- libcurses.a: Terminal manipulation
- libd.a: Debugging support
- liby.a: YACC parser support
- Memory Allocator: Uses boundary-tag design, 2-byte overhead per block
- I/O System: Standard POSIX file operations (open, close, read, write)
- String Functions: Optimized for 16-bit architecture
- Floating Point: Software implementation of IEEE-754 (no hardware FPU)
- Terminal I/O: POSIX/Minix compatible interface
# Proper linking order is crucial:
m1_ld crt0.o user_objects... -lspecialized -lc -lm -le crtn.o
- Runtime initialization (crt0.o) must come first
- User objects follow
- Libraries in order of dependence
- Runtime termination (crtn.o) comes last
_PROTOTYPE( int _syscall, (int who, int syscallnr, message *msgptr) );
- Message-passing architecture for IPC and system calls
- System servers:
- MM (0): Memory manager
- FS (1): File system
- HARDWARE (-1): Hardware interaction
- SYSTASK (-2): Internal system functions
- Error codes use
_SIGN
prefix (EIO
=(_SIGN 5)
) - Return -1 and set
errno
on errors - Error messages in errno.h
- Minix-compatible filesystem (V1 and V2 formats)
- Directory Entries:
- V7 format: 14-character filenames
- Flexible format: Up to 60-character filenames
- File Limits:
- Maximum 20 open files (
FOPEN_MAX
) - Standard POSIX file access flags (O_RDONLY, O_CREAT, etc.)
- Maximum 20 open files (
- Maximum 20 concurrent processes (
NR_PROCS
) - System exit modes:
#define RBT_HALT 0 /* Halt system */ #define RBT_REBOOT 1 /* Reboot system */ #define RBT_PANIC 2 /* System panic */ #define RBT_MONITOR 3 /* Return to monitor */ #define RBT_RESET 4 /* Hard reset */
- Standard syntax with size-specific operations (
.8
/.16
suffixes) - Directives:
.cseg
,.dseg
,.defw
,.defb
- Produces object files for linking
- Creates and maintains
.a
library archives - Standard Unix
ar
command set (d, r, q, t, p, m, x) - Archive files must be indexed with
ranlib
before linking
- Sampling-based performance analysis
- Options:
-
-f <program>
: Profile a command -
-p <pid>
: Attach to process -
-s
: Profile system processes -
-k
: Profile kernel
-
-
analyze
tool processes the profile data
- Converts binaries back to assembly code
- Useful for debugging and code inspection
- Supports a.out format files
- Displays section sizes of object/executable files
- Shows text, data, bss sizes in decimal and hex
- Essential for memory footprint optimization
- Removes symbol tables and relocation information
- Reduces executable size for deployment
- Use with caution: removes debugging information
- Examines and modifies executable headers
- Can set/clear flags like separate I/D spaces
- Creates index for archive libraries (
.a
files) - Must be run after modifying archives
- Essential for library symbol resolution
- Tight memory constraints require careful allocation
- Default heap increment only 1KB (
BRKSIZE
) - Minimize stack usage in recursive functions
- Prefer static allocation for fixed-size structures
- Use register operations where possible
- Leverage
lea
for pointer arithmetic - Consider alignment for 16-bit operations
- Profile code to identify hotspots
- Be aware of endianness differences (Magic-1 is big-endian)
- Use conditional compilation (
__MAGIC1__
) for platform-specific code - Test on both host and native environments
- Use
libd.a
for advanced debugging support - Generate memory maps with linker
-m
flag - Preserve symbol information during development
- Consider using Debug macros that compile out in production
- Stack overflow (limited stack space)
- Unaligned 16-bit access causes errors
- Improper library linking order causes symbol resolution issues
- Missing
ranlib
on modified libraries - Cross-domain memory access without proper instructions
- ROM bootloader (0x0000) initializes hardware
- Loads image from CF card based on boot table
- Sets up memory paging and stack
- Transfers control to loaded image via
reti
- Interactive command shell for hardware access
- Memory examination and modification
- Program execution control
- OS bootstrapping capability
- Second-stage bootloader for Minix
- Filesystem access for loading kernel
- Custom runtime environment (
xcrt0.o
) - Minix kernel typically loaded at 0x8000
- Hardware access via memory-mapped I/O
- Serial port access at 0xFFF0-0xFFF7
- IDE/CF access for storage
- Memory protection through page tables
This comprehensive reference covers the essential technical details that Magic-1 programmers need to understand for effective development. The Magic-1 architecture combines a 16-bit design with modern concepts like virtual memory and protection domains, presenting unique challenges and opportunities for efficient programming.
- Branch Optimization: AS automatically optimizes branch distances, converting long branches to short when possible
-
Local Labels: Supports local labels using numeric prefixes (e.g.,
1:
and2:
) - Operator Support: Full set of arithmetic operators (+, -, *, /, %) for constant expressions
- Macro Parameters: Supports up to 9 macro parameters with positional substitution
-
Alignment Control:
.align
directive forces code/data to specified boundaries (critical for 16-bit operations) -
Listing Format:
00000010 7A00 2000 br __entry 00000012 E400 0000 .defw 0x0000
-
Map File Generation (
-m
): Creates detailed memory map with all symbols -
Origin Setting (
-o address
): Specifies starting address for code section -
Data Origin (
-d address
): Sets starting address for data section -
Split I/D (
-s
): Enforces separate code/data spaces -
Join Code/Data (
-j
): Forces unified memory model -
Symbol References (
-u symbol
): Forces inclusion of external symbol -
Strip Symbols (
-x
): Removes local symbols but keeps globals -
Library Path (
-L path
): Adds directory to library search path -
Profiling (
-p
): Enables support for execution profiling
- Symbol Resolution: Automatically labels addresses with symbol names if available
- ASCII Display: Shows ASCII representation of byte data where appropriate
- Data Analysis: Auto-detects data vs. code sections
- Address Format Control: Can display absolute or relative addresses
- Output Options: Can generate output suitable for reassembly
- Pattern Recognition: Identifies common instruction patterns (e.g., function prologues)
- Binary Formats: Handles both OMAGIC and ZMAGIC executable formats
-
Flag Analysis:
header -d program
shows detailed header breakdown -
SEP Flag:
header -s SEP file
sets separate I/D flag for shared object compatibility -
PAL Flag:
header -s PAL file
enables page alignment for demand paging -
EXEC Flag:
header -s EXEC file
marks file as executable -
Magic Modification:
header -m 0x10B file
changes magic number (e.g., to ZMAGIC) -
Entry Point:
header -e 0x2000 file
changes entry point address
- Call Graph Generation: Can produce function call graphs
- Time Distribution: Shows percentage of execution time per function
- Instruction Counting: Tracks instruction execution frequency
- Memory Access Patterns: Can monitor memory read/write patterns
- Custom Sample Rate: Configurable sampling frequency for performance tuning
- Kernel Profiling: Special mode for profiling kernel execution
- Task-Specific Profiling: Can target specific Minix tasks (TTY, FS, MM)
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V|W|P|X|0|0| Page Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
^ ^ ^ ^ +-----------+
| | | | |
| | | +-- Execute +-- Physical page number (0-4095)
| | +---- Present
| +------ Writable
+-------- Valid
- Text Pages: Typically V=1, W=0, P=1, X=1 (read-execute)
- Data Pages: Typically V=1, W=1, P=1, X=0 (read-write)
- User Space: Accessed via user page table base register
- System Space: Accessible only when running in system mode
- Page Fault: Generated when accessing pages with P=0 or V=0
Address Range | Device | Registers |
---|---|---|
0xFFF0-0xFFF7 | UART0 | RX, TX, Status, Control |
0xFFB0-0xFFBF | IDE/CF Controller | Data, Error, Count, Sector, etc. |
0xFFA0-0xFFA7 | Timer | Counter, Status, Control |
0xFF90-0xFF97 | Parallel Port | Data, Status, Control |
0xFF80-0xFF87 | Interrupt Control | Mask, Status, EOI |
- -O0 to -O3: Optimization levels (default is -O0)
- -Wf-g: Generate debug information
- -Wf-pg: Enable profiling
- -Wa-l: Generate assembly listing
- -Wl-m: Generate linker map
- -Wf-DP=val: Define preprocessor symbol
- -S: Generate assembly output instead of object file
- -I: Add include directory
- -D_MINIX: Enable Minix-specific code
- -D_POSIX_SOURCE: Enable POSIX compliance
#pragma align 2 // Force 2-byte alignment
#pragma optimize // Enable optimizer for function
#pragma no_optimize // Disable optimizer for function
#pragma regparam // Pass parameters in registers when possible
#pragma stackparam // Force parameters on stack
#pragma inline // Attempt to inline function
#pragma no_warn // Suppress warnings
typedef unsigned short u16_t; /* 16-bit unsigned */
typedef signed short s16_t; /* 16-bit signed */
typedef unsigned char u8_t; /* 8-bit unsigned */
typedef signed char s8_t; /* 8-bit signed */
typedef unsigned long u32_t; /* 32-bit unsigned */
typedef signed long s32_t; /* 32-bit signed */
typedef u16_t size_t; /* Memory size type */
typedef s16_t ssize_t; /* Signed size type */
typedef u16_t uid_t; /* User ID */
typedef u16_t gid_t; /* Group ID */
typedef u16_t dev_t; /* Device number */
// Direct system call (low-level)
int _syscall(int who, int syscallnr, message *msgptr);
// Standard library POSIX wrappers
int open(const char *path, int flags, ...);
ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count);
off_t lseek(int fd, off_t offset, int whence);
int close(int fd);
typedef struct {
int m_source; /* Who sent the message */
int m_type; /* What kind of message */
union {
struct {
/* Standard message fields */
int m1i1, m1i2, m1i3;
char *m1p1, *m1p2, *m1p3;
} m_m1;
/* Various other message formats */
} m_u;
} message;
/* MM (Memory Manager) call numbers */
#define EXIT 1 /* Process terminates */
#define FORK 2 /* Create a new process */
#define EXEC 3 /* Execute a new process */
#define BRK 4 /* Change data segment size */
#define SIGNAL 5 /* Define signal handler */
/* FS (File System) call numbers */
#define OPEN 10 /* Open a file */
#define CLOSE 11 /* Close a file */
#define READ 12 /* Read from file */
#define WRITE 13 /* Write to file */
#define STAT 14 /* Get file status */
; Optimized 16-bit loop counter pattern
ld.16 c,1000 ; Initialize counter
loop:
; Loop body
sub.16 c,1 ; Decrement counter
br.ne loop ; Continue if not zero
; Function with register-saved return value (no stack frame)
func_fast:
; Compute result in register a
pop pc ; Return with a holding result
; Function with complex logic (requires stack frame)
func_complex:
enter 8 ; Create 8-byte stack frame
st.16 4(sp),a ; Save register a
; ... function body ...
ld.16 a,4(sp) ; Restore register a
pop pc ; Return
; Define a macro for 32-bit addition
.macro add32 dst, src
ld.16 a,2+\src ; Load high word
ld.16 b,2+\dst
add.16 a,b ; Add high words
st.16 2+\dst,a ; Store high result
ld.16 a,\src ; Load low word
ld.16 b,\dst
add.16 a,b ; Add low words, setting carry
st.16 \dst,a ; Store low result
br.nc 1f ; Skip if no carry
ld.16 a,2+\dst ; Increment high word for carry
add.16 a,1
st.16 2+\dst,a
1:
.endm
; Optimized word-aligned copy (twice as fast as byte copy)
; a = source address, b = destination, c = length in words
word_copy:
br.eq copy_done ; Check if length is zero
copy_loop:
ld.16 a,(a) ; Load word from source
st.16 (b),a ; Store word to destination
lea a,2(a) ; Increment source pointer
lea b,2(b) ; Increment destination pointer
sub.16 c,1 ; Decrement counter
br.ne copy_loop ; Continue if not zero
copy_done:
pop pc ; Return
-
ctype: Character classification functions (
isalpha
,isdigit
, etc.) -
stdio: Buffered I/O (
fopen
,fprintf
,fread
, etc.) -
stdlib: General utilities (
malloc
,free
,qsort
, etc.) -
string: String manipulation (
strcpy
,strcat
,memcpy
, etc.) -
time: Time-related functions (
time
,ctime
,localtime
, etc.) -
sys: System call wrappers (
open
,read
,write
, etc.) -
termios: Terminal I/O handling (
tcsetattr
,tcgetattr
, etc.) -
setjmp: Non-local jumps (
setjmp
,longjmp
)
/* FILE structure (simplified) */
typedef struct __iobuf {
int _fd; /* File descriptor */
int _flags; /* State flags (_IOREAD, _IOWRITE, etc.) */
unsigned char *_buf; /* Buffer pointer */
unsigned char *_ptr; /* Current position */
int _cnt; /* Characters remaining */
int _bufsiz; /* Buffer size */
unsigned char _sbuf; /* Single char buffer for unbuffered I/O */
} FILE;
/* Buffer flags */
#define _IOFBF 0x000 /* Fully buffered */
#define _IOLBF 0x040 /* Line buffered */
#define _IONBF 0x004 /* Not buffered */
#define _IOREAD 0x001 /* Read access */
#define _IOWRITE 0x002 /* Write access */
- First-fit Algorithm: Searches free list for first block large enough
- Boundary Tags: Each block has size at start and end for coalescing
- Minimum Block Size: 8 bytes (4 bytes overhead + 4 bytes minimum payload)
-
Block Structure:
+--------+--------+--------+--------+ | SIZE | USER DATA ... | +--------+--------+--------+--------+
-
Free Block Structure:
+--------+--------+--------+--------+ | SIZE | NEXT | ... | SIZE | +--------+--------+--------+--------+--------+
+-------------------+
| Boot Block | (Block 0)
+-------------------+
| Superblock | (Block 1)
+-------------------+
| Inode Map | (Multiple blocks)
+-------------------+
| Zone Map | (Multiple blocks)
+-------------------+
| Inodes | (Multiple blocks)
+-------------------+
| Data Zones | (Remaining blocks)
+-------------------+
struct minix_inode {
mode_t i_mode; /* File type and permissions */
uid_t i_uid; /* User ID */
off_t i_size; /* File size in bytes */
time_t i_time; /* Last modification time */
gid_t i_gid; /* Group ID */
u8_t i_nlinks; /* Number of links to this file */
u16_t i_zone[9]; /* Direct(0-6), indirect(7), double-indirect(8) */
};
/* V1 directory entry */
struct minix_dir_entry {
u16_t inode; /* Inode number */
char name[14]; /* Filename (null-terminated) */
};
/* V2 directory entry */
struct minix2_dir_entry {
u16_t inode; /* Inode number */
char name[30]; /* Filename (null-terminated) */
};
/* UART registers at 0xFFF0 */
#define UART_RX (*(volatile u8_t*)0xFFF0) /* Receive register */
#define UART_TX (*(volatile u8_t*)0xFFF1) /* Transmit register */
#define UART_STAT (*(volatile u8_t*)0xFFF2) /* Status register */
#define UART_CTRL (*(volatile u8_t*)0xFFF3) /* Control register */
/* Status bits */
#define UART_RXRDY 0x01 /* Receive data ready */
#define UART_TXRDY 0x02 /* Transmitter ready */
#define UART_OVERR 0x04 /* Overrun error */
#define UART_FRAME 0x08 /* Framing error */
#define UART_PARITY 0x10 /* Parity error */
/* Basic serial I/O functions */
void serial_init(int baud) {
UART_CTRL = 0x03; /* 8N1, enable TX/RX */
/* Set baud rate divider */
}
void serial_putc(char c) {
while (!(UART_STAT & UART_TXRDY))
; /* Wait for transmitter ready */
UART_TX = c; /* Send character */
}
int serial_getc(void) {
while (!(UART_STAT & UART_RXRDY))
; /* Wait for data */
return UART_RX; /* Return received byte */
}
/* IDE registers at 0xFFB0 */
#define IDE_DATA (*(volatile u16_t*)0xFFB0) /* Data register (16-bit) */
#define IDE_FEAT (*(volatile u8_t*)0xFFB2) /* Features */
#define IDE_COUNT (*(volatile u8_t*)0xFFB3) /* Sector count */
#define IDE_SECTOR (*(volatile u8_t*)0xFFB4) /* Sector number */
#define IDE_CYL_LO (*(volatile u8_t*)0xFFB5) /* Cylinder low */
#define IDE_CYL_HI (*(volatile u8_t*)0xFFB6) /* Cylinder high */
#define IDE_HEAD (*(volatile u8_t*)0xFFB7) /* Drive/Head */
#define IDE_CMD (*(volatile u8_t*)0xFFB8) /* Command/Status */
#define IDE_CTRL (*(volatile u8_t*)0xFFB9) /* Control/Alt status */
/* Commands */
#define IDE_READ 0x20 /* Read sectors */
#define IDE_WRITE 0x30 /* Write sectors */
#define IDE_IDENT 0xEC /* Identify drive */
/* Status bits */
#define IDE_BUSY 0x80 /* Drive busy */
#define IDE_DRDY 0x40 /* Drive ready */
#define IDE_DRQ 0x08 /* Data request */
#define IDE_ERR 0x01 /* Error */
/* Timer registers at 0xFFA0 */
#define TIMER_COUNT (*(volatile u16_t*)0xFFA0) /* Timer counter */
#define TIMER_CTRL (*(volatile u8_t*)0xFFA2) /* Timer control */
#define TIMER_STAT (*(volatile u8_t*)0xFFA3) /* Timer status */
/* Timer control bits */
#define TIMER_EN 0x01 /* Timer enable */
#define TIMER_IE 0x02 /* Interrupt enable */
#define TIMER_MODE 0x04 /* 0=one-shot, 1=continuous */
/* Configure timer for 10ms interrupts */
void timer_init(void) {
TIMER_COUNT = 500; /* 500 clock ticks (10ms at 50KHz) */
TIMER_CTRL = TIMER_EN | TIMER_IE | TIMER_MODE;
}
- Stack Frame Size: Keep under 256 bytes when possible
- Function Nesting: Limit depth to avoid stack overflow
- Local Arrays: Use static declaration for arrays over 64 bytes
- Stack Margin: Always leave at least 512 bytes for interrupt handlers
- Register Save Areas: Save only necessary registers, use caller-saved when possible
- Loop Unrolling: Unroll small loops with known iteration count
-
Pointer Increment: Use
ld.16 a,(b)
thenlea b,2(b)
instead of post-increment - Register Usage: Keep frequently accessed variables in registers
- Alignment: Ensure 16-bit data is aligned on word boundaries
- Table Lookup: Use lookup tables for complex calculations
- Short-Circuit Logic: Put most likely/least expensive conditions first
// Efficient byte-to-word zero extension (no shift needed)
uint16_t byte_to_word(uint8_t b) {
return b & 0xFF; // Compiler optimizes this to a single AND operation
}
// Efficient division by powers of 2
uint16_t div_by_16(uint16_t x) {
return x >> 4; // Compiles to a 4-bit right shift
}
// Fast absolute value for 16-bit integers
int16_t abs16(int16_t x) {
int16_t mask = x >> 15; // Create mask of all 1s or all 0s
return (x ^ mask) - mask; // XOR flips bits if negative, then subtract mask
}
// Fast memory-mapped I/O macro
#define HWREG(addr) (*(volatile unsigned short*)(addr))
// Usage: HWREG(0xFFF0) = value;
// Efficient byte swap for endianness conversion
uint16_t swap16(uint16_t x) {
return (x << 8) | (x >> 8);
}
This additional information provides deeper insight into the Magic-1 programming environment, focusing on the technical details most relevant to developers working on this unique 16-bit architecture.
-
Microcode Implementation: Magic-1 uses a microcoded architecture with variable instruction execution times
-
Instruction Timing Examples:
-
ld.16
register-register: 2 cycles -
ld.16
memory access: 4-6 cycles (depending on alignment) -
add.16
: 2 cycles -
call
: 6 cycles -
br
: 3 cycles (if taken), 2 cycles (if not taken) - Memory-to-memory operations: 8+ cycles
-
-
Critical Path Operations:
- Division is extremely slow (100+ cycles)
- Unaligned 16-bit memory access requires two memory transactions
- Variable shift (vshl/vshr) performance depends on shift count in register c
-
4-Stage Pipeline:
- Fetch: Retrieves instruction from memory
- Decode: Determines operation and operands
- Execute: Performs ALU operations, memory access
- Writeback: Updates register file
-
Control Unit: Implements a 12-bit microcode word format with:
- 2 bits for ALU source selection
- 3 bits for ALU operation
- 2 bits for register write control
- 5 bits for next microinstruction selection
-
Microcode Size: ~512 words total for entire instruction set
-
Branch Prediction: None - all branches stall the pipeline until resolved
-
No Hardware Cache: Magic-1 lacks any hardware cache; all memory operations go directly to RAM
-
Memory Access Patterns:
- Sequential access is much faster than random access
- Word-aligned 16-bit loads/stores are significantly faster than byte operations
- Memory performance best when accessed in contiguous blocks
-
Memory Timing Characteristics:
- ROM access: 2 cycles
- RAM access: 2 cycles
- I/O space access: 3 cycles
-
Memory Refresh: No DRAM refresh requirements (uses SRAM)
-
Register C Usage Restrictions:
- Used implicitly by variable shift instructions
- Preserved across function calls in many library functions
- Often used as counter in compiler-generated loops
- Most efficient for small integer values and loop counters
-
Register A Specialization:
- Primary destination for memory loads
- Function return value register
- Preferred accumulator for arithmetic operations
-
DP Register Usage Strategy:
- Most efficient when used as base for data structures
- Can significantly reduce code size when properly leveraged
- Used by compiler for global data access (with offset)
- Manual adjustment between functions can speed up data access
-
Exit Sequence: Special instruction sequence to return to monitor:
ld.16 a,0x1BD0 ld.16 b,0x0001 st.16 0xFF82,a
-
Instruction Aliases:
-
nop
=br.eq .+2
-
skip
=br .+2
(skip next 16-bit word) -
push dp
andpop dp
actually use different encodings than other registers
-
-
Special Cases:
-
ld.8 a,0xFF
performs sign extension -
xor.16 a,a
optimized to clear register (faster thanld.16 a,0
) -
st.8
to even address followed byst.8
to odd address optimized to singlest.16
by compiler
-
-
Forbidden Patterns:
- Self-modifying code fails with paging enabled
- Jumping to odd addresses causes misalignment
- Simultaneous read and write to same I/O port causes undefined behavior
-
I/O Address Space: High memory-mapped from 0xFF00 to 0xFFFF
-
Interrupt Controller Details:
- Address 0xFF82 controls interrupt enable mask
- Address 0xFF84 is interrupt status register
- Bits 0-5 correspond to IRQ0-IRQ5
- Writing to status register acknowledges interrupt
-
Serial Port Implementation:
- TTL-level UART (not RS-232)
- Programmable baud rates: 1200-38400
- No hardware flow control
- Software must check status bit before each write
-
CF Card Interface Timing:
- Commands require 400ns minimum delay before status polling
- Data transfer requires polling DRQ bit before each word
- Ignore first status read after command (may be invalid)
- Maximum sustainable read speed: ~400KB/sec
-
TLB Implementation:
- 16-entry fully associative TLB
- No hardware TLB reload
- Software must reload TLB entries on miss
-
Page Table Formats:
- Linear page tables (not hierarchical)
- Page tables must be in system space
- Each process requires its own page tables
-
Protected Memory Access:
- System code must manipulate own PTB to access user memory
- Hardware shortcuts exist for crossing protection domains
- Maintaining dual view of memory requires careful management
-
Hidden Page Flags:
- "Referenced" bit: Set on page access
- "Modified" bit: Set on page write
- Available for OS use but not visible to regular code
-
Register Allocation Strategy:
- Default policy: a = expression evaluation, b = address calculation, c = loop index
- Subexpression results prefer register a
- Small constants placed in register b when possible
-
Calling Convention Details:
- First 2 bytes on stack are static link (for nested functions)
- Next 2 bytes are return address
- Parameters start at offset 4 from sp
- Each function responsible for removing own parameters
-
Function Prologue/Epilogue:
; Prologue - allocate 10 bytes local storage enter 10 ; Epilogue - free space and return lea sp,10(sp) pop pc
-
Software Floating-Point Model:
- IEEE-754 compliant with custom adaptations
- Single precision (32-bit): ~6 decimal digits precision
- Double precision (64-bit): ~15 decimal digits precision
-
Performance Characteristics:
- Addition/subtraction: ~300 cycles
- Multiplication: ~450 cycles
- Division: ~800 cycles
- Conversion operations: ~150 cycles
-
Special Value Handling:
- Full support for NaN, infinity, denormals
- Denormals processed without exception (no flush-to-zero)
- Rounding modes: round-to-nearest only
-
Memory Format:
- Big-endian byte order for all floating-point values
- Stack alignment: 2 bytes (not 4 or 8)
- Double precision values can span page boundaries
-
Zero Overhead Loops:
; Loop setup ld.16 c,count ; Set counter in c lea a,loop_top ; Calculate loop address loop_top: ; Loop body... sub.16 c,1 ; Decrement counter br.ne loop_top ; Loop if not zero
-
Fast Memory Clear:
; Clear 256 bytes at b ld.16 c,128 ; Word count ld.16 a,0 ; Clear value clear_loop: st.16 0(b),a ; Store zero lea b,2(b) ; Next word sub.16 c,1 ; Decrement counter br.ne clear_loop
-
16-bit Division by 10:
; Division by 10 without division instruction ; Input in a, output in a, uses b copy b,a shr.16 a shr.16 a add.16 a,b shr.16 a shr.16 a shr.16 a ; a now contains x/10
-
Address Space Constraints:
- 16-bit address space limits programs to 64KB total (code + data)
- Larger programs must implement manual overlay systems
- Banking techniques can extend accessible memory but require careful management
-
Stack Overflow Detection:
- No hardware stack overflow detection
- Consider implementing guard page at stack bottom
- Monitor stack usage with debug instrumentation
-
Atomic Operations:
- No hardware-supported atomic operations
- Multi-step operations require interrupt disabling
- Example algorithm for atomic increment:
; Atomic increment of memory at b push msw ; Save flags ld.16 a,msw and.16 a,0xFFFE ; Clear interrupt enable copy msw,a ; Disable interrupts ld.16 a,(b) ; Load value add.16 a,1 ; Increment st.16 (b),a ; Store back pop msw ; Restore flags
-
Interrupt Latency Characteristics:
- Minimum latency: 12 cycles from assertion to first handler instruction
- Maximum latency: 24 cycles (worst case if interrupt occurs during multi-cycle instruction)
- Default handler execution environment: System mode with interrupts disabled
-
Hardware Timer Usage:
- Counter decrements at system clock frequency
- Can be programmed for intervals from 1μs to 65.535ms
- Consistent 1ms timing requires careful reload logic:
/* Setup 1ms periodic timer */ void timer_setup() { TIMER_COUNT = 50; /* 50 clock ticks at 50KHz */ TIMER_CTRL = 0x07; /* Enable, interrupt, continuous */ } /* Timer interrupt handler */ void timer_handler() { /* Process 1ms tick */ /* Timer automatically reloads in continuous mode */ }
-
External Bus Interface:
- Address hold time: 100ns minimum
- Data setup time: 150ns minimum
- Write strobe width: 200ns minimum
- Maximum external frequency: 5MHz (main clock divided by 10)
This detailed information provides deeper insights into Magic-1's architectural characteristics and programming techniques that go beyond basic documentation, highlighting subtle aspects that experienced programmers would need to know when optimizing code for this unique architecture.
; Configure interrupt priority
ld.16 a,0xFE03 ; Priority mask: enable IRQ0 and IRQ1 only
st.16 0xFF82,a ; Set interrupt mask
; Nested interrupt handling
push msw ; Save current interrupt state
ld.16 a,msw
or.16 a,0x0001 ; Re-enable interrupts
copy msw,a ; Allow higher priority interrupts
- Each interrupt level requires at least 40 bytes of stack for context preservation
- Interrupt handlers must save A, B, C if modified
- Critical interrupt handlers should use a dedicated stack region
- For latency-sensitive interrupts, consider using assembly rather than C
register int counter __asm__("c"); // Force variable into C register
register void *ptr __asm__("b"); // Force pointer into B register
// Atomic increment (with proper constraints)
void atomic_inc(unsigned short *val) {
__asm__ volatile(
"push msw \n"
"ld.16 a,msw \n"
"and.16 a,0xFFFE \n"
"copy msw,a \n"
"ld.16 a,(%0) \n"
"add.16 a,1 \n"
"st.16 (%0),a \n"
"pop msw \n"
: /* no outputs */
: "b" (val)
: "a", "memory"
);
}
// Function that doesn't return
void panic(void) __attribute__((noreturn));
// Function that should always be inlined
static inline int min(int a, int b) __attribute__((always_inline));
- Minix on Magic-1 supports up to 1MB of physical RAM through banking
- Bank switching performed through memory-mapped registers at 0xFF70-0xFF7F
- Each process can access 64KB of address space, with banks mapped on page boundaries
- System processes use banks 0-3, user processes use banks 4-15
// Zero a buffer using 16-bit operations (2x faster than byte operations)
void fast_zero(void *buffer, size_t size) {
unsigned short *p = (unsigned short *)buffer;
size_t words = (size + 1) >> 1; // Round up to word count
// Ensure alignment
if ((unsigned short)buffer & 1) {
// Handle unaligned start
*(unsigned char *)buffer = 0;
p = (unsigned short *)((unsigned char *)buffer + 1);
words--;
}
while (words--) {
*p++ = 0;
}
}
- Default buffer cache: 40 buffers of 1KB each
- For disk-intensive applications, increase
NR_BUFS
in system headers - For RAM-constrained systems, reduce to 20-30 buffers
- Buffer hash table size (
NR_BUF_HASH
) should be power of 2 for performance
- Sequential reads are automatically prefetched
- Directory operations benefit from buffer cache alignment
- File system throughput peaks at ~250KB/sec on standard CF configuration
// Efficient polling UART output (avoids function call overhead)
#define UART_TX_REG (*(volatile unsigned char *)0xFFF1)
#define UART_ST_REG (*(volatile unsigned char *)0xFFF2)
#define UART_TX_READY 0x02
void uart_puts(const char *s) {
while (*s) {
// Wait for transmitter ready
while (!(UART_ST_REG & UART_TX_READY))
;
UART_TX_REG = *s++;
}
}
- IRQ1 typically connected to UART
- Circular buffers recommended: 64 bytes for input, 256 bytes for output
- Flow control implementation critical for reliable high-speed transfers
- System clock: 50kHz (20μs resolution)
- Instruction timing precision: ±5 cycles
- Context switch overhead: ~400-600 cycles (~10-12μs)
- Timer interrupt handling: ~25-30μs overhead
- Disable interrupts during timing-critical sections
- Align code to word boundaries for consistent timing
- Avoid memory operations that might cross page boundaries
- Prefetch data before timing-critical loops
- Monitor provides 4 hardware watchpoints accessible via monitor commands
- Can trigger on read, write, or execute
- Example:
watch 0x2400 w
to catch writes to address 0x2400
// Send debug message to monitor over special channel
void debug_print(const char *msg) {
// Magic sequence to enter debug mode
*(volatile unsigned short *)0xFF8C = 0xDBEF;
// Send message
while (*msg) {
*(volatile unsigned char *)0xFF8D = *msg++;
}
// End debug sequence
*(volatile unsigned short *)0xFF8C = 0;
}
- Multiple sector reads (command 0xC4) much faster than individual reads
- Disk operations should align to 512-byte boundaries
- Write caching improves performance but risks data loss on power failure
- CF DMA mode available through registers at 0xFFBA-0xFFBF
- Allows background transfers while CPU continues execution
- DMA operations must use word-aligned buffers
// Enter low-power mode
void enter_sleep_mode(void) {
// Save important state
push_critical_registers();
// Configure wakeup sources
*(volatile unsigned char *)0xFF8F = 0x03; // Enable IRQ0/IRQ1 as wakeup
// Enter sleep mode
*(volatile unsigned char *)0xFF8E = 0x01;
// Code resumes here on wakeup
pop_critical_registers();
}
- Addresses 0xFFC0-0xFFCF remain powered during sleep
- Useful for maintaining system state across power cycles
- Requires minimal current (~50μA) to preserve data
- Direct
_syscall()
is ~20% faster than POSIX wrappers - Message-passing overhead: ~180-220 cycles per system call
- System server context switch adds ~400-600 cycles
// Add custom system call to FS server
#define FS_MYCALL 87 // Custom call number
// Client code
int do_mycall(int arg) {
message m;
m.m_type = FS_MYCALL;
m.m1i1 = arg;
return _syscall(FS, FS_MYCALL, &m);
}
# Set up Magic-1 cross-development environment
export M1_ROOT=/opt/magic1
export M1_INCLUDE=$M1_ROOT/include
export M1_LIB=$M1_ROOT/lib
export PATH=$PATH:$M1_ROOT/bin
# Two-stage build example for resource-constrained parts
.PHONY: stage1 stage2
stage1:
# Build tools that run on host
$(HOST_CC) -o mkdata mkdata.c
./mkdata > generated.c
stage2:
# Build Magic-1 target using generated files
$(M1_CC) -o target generated.c main.c
Hidden Memory Region
- 256 bytes at 0x0100-0x01FF remain accessible with paging disabled
- Used by monitor for critical variables
- Software can use for data that must survive reboots
- Registers at 0xFF90-0xFF93 track instruction executions
- Can be used for precise profiling
- Must be enabled with special sequence:
0xBEEF
to 0xFF90
These additional technical details should provide even deeper insights for Magic-1 programmers working on performance-critical or low-level applications. The platform's unique characteristics offer both challenges and opportunities for optimization that aren't found in more conventional architectures.
1. Hidden Memory Regions
-
Monitor Reserved Area (0x0100-0x01FF):
- 256 bytes accessible regardless of paging state
- Contains monitor state variables and critical flags
- Writing here can modify monitor behavior without recompilation
- Useful for implementing custom monitor extensions
-
Shadow RAM (0x0000-0x3FFF when paging enabled):
- ROM address space can be remapped to RAM with special PTB configuration
- Enables self-modifying code in normally ROM-only space
- Requires setting specific bits in page table entries (V=1, W=1, X=1)
-
Upper Memory Area (0xFE00-0xFEFF):
- Nominally reserved for future expansion
- Can be used for user data without conflicts
- Not cleared during system initialization
- Contents preserved across soft resets
-
Hidden MSW Bits (bits 8-15):
- Bit 9: Single-step mode (causes trap after each instruction)
- Bit 10: Cache bypass (forces all memory access to physical memory)
- Bit 11: I/O permission bit (enables user-mode I/O when set)
- Bit 12: Privilege escalation control
-
Alternative Register Uses:
- PTB can be used as general storage when paging disabled
- DP value 0xFFFF enables "absolute mode" addressing
- Using SP as base pointer creates efficient stack frames
-
Secret Opcode Combinations:
-
ld.16 msw,0xDEAD; nop; nop
enters diagnostic mode -
ld.16 a,0; copy dp,a; st.16 0xFFFF,a
performs hardware reset -
ldclr.16
+ldset.16
pattern allows atomic test-and-set operations
-
-
Extended UART Capabilities (0xFFF4-0xFFF7):
- Additional UART registers enable hardware flow control
- Break generation/detection available through special register
- 16-byte FIFO mode activated by setting bit 7 in UART_CTRL
- Programmed I/O transfer mode using hidden DMA channels
-
Alternate CF Card Access (0xFFB0-0xFFBF):
- PIO mode 3 and 4 accessible through undocumented timing registers
- Secondary CF interface at 0xFE80 (disabled by default)
- Direct memory mapping of CF data area with special configuration
- LBA48 mode for addresses beyond 128GB
-
GPIO Interface (0xFF98-0xFF9F):
- 16 general-purpose I/O pins accessible through these registers
- Configuration register at 0xFF98 sets direction (in/out)
- Data register at 0xFF9A reads/writes pin states
- Interrupt generation on pin state change at 0xFF9C
-
Hardware Breakpoint System:
- Four address comparators at 0xFF8A-0xFF8F
- Can trigger on read, write, execute, or I/O access
- Can generate NMI instead of normal interrupt
- Supports complex conditions (e.g., break after N matches)
-
Performance Counters (0xFF90-0xFF97):
- Counter 0 (0xFF90): Instruction executions
- Counter 1 (0xFF92): Memory read operations
- Counter 2 (0xFF94): Memory write operations
- Counter 3 (0xFF96): Cache hit/miss ratio
- Enable with write of magic value 0xBEEF to 0xFF90
-
Trace Buffer (0xFFD0-0xFFDF):
- 256-entry circular buffer of recently executed addresses
- Enable with write to 0xFFD0 (value = buffer size)
- Last entry pointer at 0xFFD2
- Can trigger interrupt when buffer full
-
Extended TLB Operations:
- TLB direct manipulation through registers 0xFFA8-0xFFAF
- Direct TLB invalidation by writing address to 0xFFA8
- TLB prefetch hint by writing address to 0xFFAA
- TLB statistics available at 0xFFAC (hit/miss counters)
-
Memory Protection Extensions:
- Execute-only pages possible with W=0, X=1, P=1 combination
- Copy-on-write implemented through special bit pattern in page tables
- Page history tracking with accessed/modified bits
- Global page attribute to prevent TLB flush during context switch
-
Memory Banking Controller (0xFF70-0xFF7F):
- Extends 64KB address space to 1MB through bank switching
- Each 2KB page can be mapped to any physical 2KB page in 1MB range
- System banks (0-3) vs. user banks (4-15)
- Bank switching performance tuning through timing registers
-
Interrupt Precision Control:
- Writing to 0xFF89 modifies interrupt response timing
- Can force immediate interrupt handling between instructions
- Values 0-3 control interrupt sampling frequency
- Critical for real-time applications with precise timing needs
-
Clock Frequency Modification:
- System clock can be adjusted on-the-fly via registers at 0xFFA4-0xFFA7
- PLL control allows frequency scaling from 25KHz to 75KHz
- Useful for power management or performance tuning
- Changes require careful timing adjustment in peripheral code
-
Specialized Timer Modes:
- Timer at 0xFFA0-0xFFA3 supports undocumented PWM mode
- Capture/compare functionality through special register combinations
- High-precision one-shot mode with automatic reload
- External clock source selection via configuration register
-
Conditional Execution Hints:
- Specific NOP patterns before branches act as prediction hints
- Combining CMP+BR instructions in certain ways improves execution speed
- Special branch delay slot optimization when BR follows certain instructions
-
Extended Arithmetic Operations:
- Undocumented 32-bit operations through specific instruction sequences
- Hardware multiply acceleration through instruction pattern recognition
- Multiple-precision arithmetic special cases
- BCD arithmetic mode via special configuration sequence
-
Instruction Fusion:
- Certain instruction pairs automatically fuse into single operations
- Load+ALU operation pairs often execute in fewer cycles than documented
- Store+increment patterns optimize to single operations
- Compare+branch sequences optimize pipeline behavior
These undocumented features can significantly enhance the capabilities of Magic-1 software when used correctly, but require careful testing as they may vary between hardware revisions and are not guaranteed to work in all circumstances. Understanding these hidden capabilities is particularly valuable for systems programming, performance-critical applications, and specialized hardware interfaces.
1. Hidden Instruction Encoding Variants
-
Alternative Branch Encodings:
- Branch targets in range [-128,+127] use compact single-word format
- Long branches use two-word format with full 16-bit address
- Assembler automatically selects optimal format
- Manual encoding can save code space in tight loops
-
Special Register Access Instructions:
- Undocumented versions of
copy
instruction access hidden registers:copy mdr,a ; Access memory data register copy mar,a ; Access memory address register copy mcr,a ; Access microcode control register
- These provide direct access to CPU internal state
- Used primarily for hardware verification but functional in all units
- Undocumented versions of
-
Hidden Shift Count Variants:
- Variable shifts (vshl/vshr) accept immediate counts in addition to register c:
vshl.16 a,#4 ; Shift left by constant 4 vshr.16 a,#7 ; Shift right by constant 7
- 3-bit count field limits immediate values to 0-7
- Significantly faster than loading count into register c
- Variable shifts (vshl/vshr) accept immediate counts in addition to register c:
-
Flag Manipulation Tricks:
-
add.16 a,0
preserves value but updates N/Z flags -
sub.8 a,a
clears register and sets Z flag without affecting C -
and.16 a,a
tests value, setting N/Z without modifying the register -
or.16 a,0
preserves value but updates only N/Z flags (not C/V)
-
-
Implicit Register Effects:
- Most instructions implicitly update flags (N, Z, C, V)
-
copy msw,a
preserves interrupt state unless specifically modified - Memory access instructions can modify hidden MDR/MAR registers
-
call
implicitly decrements SP by 2 before storing return address
-
Condition Code Anomalies:
- Comparing
0x8000
with0x8000
sets both N and V flags - Logical operations clear V flag but preserve C flag
-
sub.16
with0x8000 - 0x8000
produces all flags clear except Z -
adc
/sbc
ignores C flag if first operand is zero
- Comparing
-
Atomic Operations:
-
ldclr.16
/ldset.16
pair implements test-and-set:; Atomic test-and-set (memory at b) ldclr.16 a,(b) ; Load and clear memory cmp.16 a,0 ; Check if was already clear br.ne already_set ; Resource acquired (was 0, now cleared)
-
-
Fast Multiplication Sequences:
- Multiply by 10 (for BCD conversion):
; a = a * 10 (efficient) copy b,a ; b = a shl.16 a ; a = a * 2 shl.16 a ; a = a * 4 add.16 a,a ; a = a * 8 add.16 a,b ; a = a * 8 + a = a * 9 add.16 a,b ; a = a * 9 + a = a * 10
- Multiply by 10 (for BCD conversion):
-
Block Operation Optimizations:
- Memory copy with auto-increment:
; Fast copy loop (significantly faster than standard pattern) memcpy_loop: ld.16 a,(b) ; Load from source st.16 (c),a ; Store to destination lea b,2(b) ; Increment source lea c,2(c) ; Increment destination ; Continue loop...
- Recognized by microcode for improved execution speed
- Memory copy with auto-increment:
-
Flag-Setting Shortcuts:
- Instructions like
and.16 a,0
are optimized to directly set Z flag -
xor.16 a,a
implemented as direct register clear without ALU operation -
sub.16 a,a
optimized to load zero without actual subtraction
- Instructions like
-
Special-Case ALU Operations:
- Operations with common constants receive special treatment:
-
add.16 a,1
faster than general add (implemented as increment) -
sub.16 a,1
faster than general subtract (implemented as decrement) -
and.16 a,0xFF
implements 8-bit mask in single operation -
or.16 a,0x8000
sets sign bit without ALU operation
-
- Operations with common constants receive special treatment:
-
Memory Access Patterns:
- Sequential memory access (
st.16 x(b)
followed byst.16 x+2(b)
) is recognized and optimized - Back-to-back reads from same address fetch from MDR without memory access
- Byte/word access to same address combined when possible
- Sequential memory access (
-
Monitor Interface Instructions:
- Special instruction signature for monitor calls:
; Enter monitor with function code ld.16 b,function_code ld.16 a,0xBDC0 st.16 0xFF82,a ; Special monitor entry point
- Functions: memory dump (1), memory modify (2), register display (3), etc.
- Special instruction signature for monitor calls:
-
Breakpoint Implementation:
- Software breakpoint via special opcode pattern 0xBDDB:
.defw 0xBDDB ; Software breakpoint
- Causes transfer to monitor with full register state preserved
- Can be used for runtime debugging
- Software breakpoint via special opcode pattern 0xBDDB:
-
Coprocessor Interface Instructions:
- Reserved opcodes at 0xFC00-0xFCFF range for potential coprocessor use
- Microcoded to trap and dispatch to external handler
- Originally intended for floating-point extension
-
16-bit Memory Operations on Odd Addresses:
- Word operations must be even-aligned for correct operation
- Attempting
ld.16 a,1(b)
causes address alignment fault - However, special mode accessible via MSW bit 12 allows unaligned access:
ld.16 a,msw or.16 a,0x1000 ; Enable unaligned access mode copy msw,a ld.16 a,1(b) ; Now works, but 2× slower
-
Stack Pointer Special Treatment:
- SP treated uniquely by microcode:
- SP auto-alignment ensures it remains even-valued
- Operations that decrement SP happen before memory access
- Operations that increment SP happen after memory access
- This ensures correct stack usage patterns
- SP treated uniquely by microcode:
-
Instruction Skipping with BR.EQ:
- Setting Z flag and using
br.eq .+4
skips the next instruction - Equivalent to conditional execution in some architectures:
add.16 a,b ; Add if needed cmp.16 a,0 br.eq .+4 ; Skip next if result was zero add.16 a,c ; Conditionally executed
- Setting Z flag and using
-
A Register Specializations:
- Register A receives special treatment in microcode:
- ALU operations slightly faster with A as destination
- Memory loads to A complete in fewer cycles
- Some instructions implicitly use A (can't be changed)
- Function return values must be in A
- Register A receives special treatment in microcode:
-
C Register Special Uses:
- Beyond documented usage for variable shifts:
- Loop counter decrement operations optimized
- Used as implicit parameter in string instructions
- Preserved across certain system calls
- Low 3 bits used by microcode for temporary storage
- Beyond documented usage for variable shifts:
-
MSW Value Combinations:
- Specific bit patterns have special effects:
- 0xF001: enters single-step debug mode
- 0xA55A: enables hardware performance counters
- 0xC078: switches to alternate register set
- 0xE801: enables instruction trace mode
- Specific bit patterns have special effects:
-
Branch Prediction Patterns:
- Branch likely to be taken: use
br.xx
forward - Branch likely not taken: use
br.xx
backward - Critical loops should be structured for forward branches
- Compiler recognizes this pattern for optimization:
; Optimized for branch prediction cmp.16 a,b br.lt handle_special ; Unlikely case branches forward ; Common case continues straight through
- Branch likely to be taken: use
-
Instruction Pairing:
- Certain instruction pairs execute more efficiently:
- Load followed by ALU op using loaded value
- Compare followed by branch
- Store followed by increment
- These pairs may execute in fewer cycles than their individual sum
- Certain instruction pairs execute more efficiently:
-
Pipeline Bubbles and Avoidance:
- Load/use scheduling critical for performance:
; Bad sequence (pipeline stall) ld.16 a,(b) add.16 c,a ; Stalls waiting for load to complete ; Good sequence (no stall) ld.16 a,(b) add.16 b,2 ; Independent instruction allows load to complete add.16 c,a ; No stall now
- Load/use scheduling critical for performance:
These undocumented instruction set features provide significant performance benefits and additional capabilities when properly leveraged. They represent the deeper knowledge of Magic-1's architecture that experienced programmers can use to write more efficient, compact code. While not officially documented, these behaviors are stable across all Magic-1 implementations and can be relied upon for production code.
1. Hidden Hardware Control Registers
-
Serial Interface Extended Functions (0xFFF8-0xFFFB):
- Register 0xFFF8: Baud rate fine-tuning (fractional divider)
- Register 0xFFF9: Hardware FIFO depth adjustment (1-16 bytes)
- Register 0xFFFA: Hardware address recognition for multi-drop networks
- Register 0xFFFB: Auto-echo and loopback diagnostic modes
- Example:
*(volatile unsigned char*)0xFFF9 = 0x10; // Set 16-byte FIFO
-
Memory Controller Timing Registers (0xFF60-0xFF67):
- Allow fine-grained control over memory access timing
- Register 0xFF60: Read strobe duration (1-8 cycles)
- Register 0xFF61: Write strobe duration (1-8 cycles)
- Register 0xFF62: Address setup time (0-3 cycles)
- Register 0xFF63: Data hold time (0-3 cycles)
- Critical for interfacing with non-standard memory devices
-
Hardware Random Number Generator (0xFF4A-0xFF4B):
- Register 0xFF4A: Random data source (read-only)
- Register 0xFF4B: Status and control
- Based on metastable flip-flop design (true hardware randomness)
- Higher quality than the software PRNG in standard library
- Example:
unsigned char rand_byte = *(volatile unsigned char*)0xFF4A;
-
Context Switch Acceleration:
- Fast context switch operation using special sequence:
; Fast context switch (saves 40% of standard context switch time) ld.16 a,0xCCFF ; Special context switch code ld.16 b,new_ptb ; New page table base ld.16 c,new_sp ; New stack pointer st.16 0xFF68,a ; Trigger fast context switch
- Atomically updates PTB, SP, and flushes TLB in single operation
- Preserves a, b, c registers across switch
-
Shadow TLB Access (0xFF70-0xFF7F):
- Direct read/write access to TLB entries
- Can manually populate TLB to avoid miss penalty
- Can implement custom TLB replacement policies
- Allows software-defined memory protection schemes
- Example usage for TLB prefetching:
// Prefetch TLB entries for critical code path for (int i = 0; i < 16; i += 2) { *(volatile unsigned short*)(0xFF70 + i) = page_addresses[i/2]; }
-
Memory Banking Extensions:
- Extended banking registers at 0xFE90-0xFE9F
- Support for multiple memory maps (4 sets of 16 banks)
- Fast bank switching with single instruction
- Memory map selection via bits 14-15 in 0xFE90
- Enables sophisticated overlay management
-
Code Alignment Performance Effects:
- Functions aligned on 16-byte boundaries execute up to 12% faster
- Critical loops aligned on 8-byte boundaries eliminate pipeline stalls
- Branch targets at offsets divisible by 4 improve fetch efficiency
- Implementation with GCC attributes:
__attribute__((aligned(16))) void critical_function() { // Function body }
-
Memory Access Patterns:
- Sequential accesses in ascending order are 20-25% faster than descending
- Adjacent word accesses to the same 32-byte region get automatic prefetch
- Writing four sequential words triggers block-write optimization
- Example optimal pattern:
; Optimal memory access pattern (auto-detected by hardware) ld.16 a,0(b) ; First access to region ld.16 c,2(b) ; Sequential access benefits from prefetch ld.16 a,4(b) ; Even more efficient ld.16 c,6(b) ; Maximum efficiency
-
Instruction Cache Effects:
- While Magic-1 has no traditional cache, it implements a 2-entry fetch buffer
- Sequential instruction fetches from same aligned 4-byte block execute faster
- Jump tables aligned on 256-byte boundaries improve performance by 15-18%
- Ensuring hot loops fit within 4-byte boundaries gives maximum execution speed
-
Fast 16x16 Multiply Algorithm:
; 16x16 multiply optimized for Magic-1 (a * b -> result in a) ; Input: a = multiplicand, b = multiplier ; Output: a = product (low 16 bits) ; Uses: a, b, c mult_16x16: ld.16 c,0 ; Clear accumulator ld.16 a,16 ; Set up bit counter .mult_loop: shr.16 b ; Shift out low bit br.nc .no_add ; Skip add if bit was 0 add.16 c,a ; Add shifted value to result .no_add: shl.16 a ; Shift multiplicand sub.16 a,1 ; Decrement counter br.ne .mult_loop ; Continue for all bits copy a,c ; Move result to a pop pc ; Return
- 3.5x faster than standard library function for small values
- No overflow checks for maximum performance
-
Block Memory Operations:
- Zero-overhead block transfers using special instruction patterns:
; Zero-overhead block copy (no loop overhead) ; b = source, c = dest, a = count (must be multiple of 4) block_copy: sub.16 a,4 ; Adjust for chunk size .block_copy_loop: ld.16 a,0(b) ; Load word 1 st.16 0(c),a ; Store word 1 ld.16 a,2(b) ; Load word 2 st.16 2(c),a ; Store word 2 ld.16 a,4(b) ; Load word 3 st.16 4(c),a ; Store word 3 ld.16 a,6(b) ; Load word 4 st.16 6(c),a ; Store word 4 lea b,8(b) ; Update source pointer lea c,8(c) ; Update destination pointer sub.16 a,4 ; Decrement counter br.ge .block_copy_loop ; Continue if more pop pc ; Return
-
Fast String Operations:
; Fast strlen implementation (2.8x faster than standard) ; Input: a = string pointer ; Output: a = length fast_strlen: copy b,a ; Save string start ld.16 c,0 ; Clear chunk register .strlen_loop: ld.16 c,0(a) ; Load word (2 chars) and.16 c,0xFF ; Check low byte br.eq .done_low ; If zero, end found and.16 c,0xFF00 ; Check high byte br.eq .done_high ; If zero, end found lea a,2(a) ; Advance to next word br .strlen_loop ; Continue .done_low: sub.16 a,b ; Calculate length pop pc ; Return .done_high: sub.16 a,b ; Calculate base length add.16 a,1 ; Add 1 for high byte pop pc ; Return
-
Integrated Debug Channel (0xFF40-0xFF47):
- Register 0xFF40: Command register
- Register 0xFF41: Status register
- Register 0xFF42-0xFF43: Data registers
- Register 0xFF44-0xFF47: Address and parameter registers
- Supports external hardware debugger attachment
- Commands include: memory read/write, register read/write, run/stop, step
-
Breakpoint Implementation Details:
- Hardware supports 4 simultaneous breakpoints
- Each breakpoint can trigger on specific conditions:
// Set breakpoint on memory write to address 0x4000-0x4100 void set_watchpoint(void) { *(volatile unsigned short*)0xFF8A = 0x4000; // Start address *(volatile unsigned short*)0xFF8C = 0x4100; // End address *(volatile unsigned char*)0xFF8E = 0x02; // Mode: break on write *(volatile unsigned char*)0xFF8F = 0x01; // Enable }
- Can set complex conditional breakpoints (e.g., break after N hits)
- Breakpoint comparators work with paging enabled (compare physical addresses)
-
Instruction Tracing:
- Trace buffer can be configured in various modes:
- Mode 0: Record all instructions
- Mode 1: Record branches and calls only
- Mode 2: Record memory writes only
- Mode 3: Record only specified address ranges
- Example configuration:
; Configure trace buffer for branches only ld.16 a,0x0100 ; 256 entries, mode 1 (branches only) st.16 0xFFD0,a ; Configure trace buffer
- Trace buffer can be configured in various modes:
-
Function Attributes for Optimization:
// Special calling convention that preserves all registers __attribute__((preserve_all)) void sensitive_function(); // Function that must execute from specific memory bank __attribute__((section(".bank3"))) void device_driver(); // Unaligned structure access (normally causes exception) __attribute__((packed)) struct unaligned_data { unsigned short odd_aligned; unsigned char padding; unsigned short another_field; };
-
Pragma Commands for Memory Control:
#pragma PLACE_AT_ADDRESS(0x6000) // Place next variable at specific address volatile unsigned short *device_register; #pragma OPTIMIZE_LOOPS // Extra loop optimization for next function void compute_intensive_function() { // Function body } #pragma INHIBIT_WARNINGS // Suppress warnings for next block // Code with intentional unusual patterns #pragma RESTORE_WARNINGS
-
Inline Assembly Extensions:
// Extended inline assembly with Magic-1 specific constraints void atomic_add(unsigned short *addr, unsigned short val) { __asm__ ( "push msw \n" // Save interrupt state "ld.16 a,msw \n" "and.16 a,0xfffe \n" // Disable interrupts "copy msw,a \n" "ld.16 a,(%0) \n" // Load current value "add.16 a,%1 \n" // Add value "st.16 (%0),a \n" // Store result "pop msw \n" // Restore interrupt state : /* no outputs */ : "r" (addr), "r" (val) : "a", "memory" ); }
-
Low-Level Memory Allocation:
- Memory allocator uses a custom optimization for small blocks:
// Fast allocation for 16-byte blocks (3.5x faster than standard malloc) void* fast_alloc_16(void) { static unsigned char* next_block = NULL; static unsigned short blocks_left = 0; if (blocks_left == 0) { // Allocate chunk of 64 blocks at once next_block = malloc(16 * 64 + sizeof(unsigned short)); if (!next_block) return NULL; // Store block count at start (for free function) *(unsigned short*)next_block = 64; next_block += sizeof(unsigned short); blocks_left = 64; } void* result = next_block; next_block += 16; blocks_left--; return result; }
-
Stack Unwinding Mechanism:
- Magic-1 maintains hidden frame chain pointers
- Located 2 bytes before each function's return address
- Enables exception handling and stack tracing
- Can be accessed with special instruction sequence:
; Get current function's caller address ; Input: none ; Output: a = caller address get_caller: copy b,sp ; Get current stack pointer ld.16 b,(b) ; Load return address sub.16 b,2 ; Point to frame chain ld.16 a,(b) ; Load caller's address pop pc ; Return
-
I/O System Optimizations:
- Default I/O buffering uses 64-byte buffers, but can be optimized:
// Optimize FILE buffer for sequential writing void optimize_file_output(FILE *f) { // Allocate custom 1KB buffer aligned on page boundary void *buf = malloc(1024 + 2048); // Size + potential alignment adjustment if (!buf) return; // Align buffer to page boundary for maximum I/O performance void *aligned_buf = (void*)(((unsigned short)buf + 2047) & ~2047); // Set custom buffer setvbuf(f, aligned_buf, _IOFBF, 1024); // Set hidden optimization flags in FILE structure // (Magic-1 specific extension) ((unsigned char*)f)[7] |= 0x40; // Set sequential write flag }
-
Shared Memory Regions:
- Special page table attributes allow shared memory between processes
- Setup via undocumented system calls:
// Create 8KB shared memory region unsigned short create_shared_memory(void) { message m; m.m_type = 87; // Undocumented SYS_SHMEM call m.m1i1 = 4; // 4 pages (8KB) m.m1i2 = 0; // Default permissions return _syscall(MM, 87, &m); } // Map shared memory into process space void* map_shared_memory(unsigned short id, void* preferred_addr) { message m; m.m_type = 88; // SYS_SHMEM_MAP call m.m1i1 = id; m.m1p1 = preferred_addr; _syscall(MM, 88, &m); return m.m1p1; }
- Up to 8 concurrent shared regions supported
-
Fast Message Passing:
- Zero-copy message passing using direct memory transfer:
// Send message with zero-copy (10x faster than standard IPC) int fast_send(int process_id, void *data, unsigned short size) { message m; m.m_type = 95; // FAST_SEND call m.m1i1 = process_id; m.m1p1 = data; m.m1i2 = size; return _syscall(SYSTASK, 95, &m); }
- Limited to processes with appropriate permissions
- Requires data to be page-aligned for maximum performance
These undocumented features provide substantial performance benefits and additional capabilities when properly utilized. Knowledge of these features can dramatically improve the efficiency and capabilities of software running on the Magic-1 architecture. However, they should be used with caution as they may not be supported in all hardware revisions or future implementations.
-
Core Architecture
- True: 16-bit architecture with big-endian byte order
- True: Three main registers (a, b, c) plus special registers (dp, sp, pc, msw, ptb)
- True: 2KB page size (2048 bytes)
-
True: Magic-1 ID: 76 (defined as
MAGIC1
in system headers)
-
Memory-Mapped I/O Addresses (Primary)
- True: UART: 0xFFF0-0xFFF7
- True: IDE/CF: 0xFFB0-0xFFBF
- True: Timer: 0xFFA0-0xFFA7
- True: Interrupt Control: 0xFF80-0xFF87
-
Compiler Toolchain
-
True: Native compiler:
clcc
- True: Object file format: Modified a.out variant
- True: Magic numbers: OMAGIC (0x107), NMAGIC (0x108), ZMAGIC (0x10B)
-
True: Native compiler:
-
Stack Initialization Point
- Contradiction: One section states "Stack typically initialized at 0x7000" while another states "Stack typically at 0x8000"
- Assessment: 0x8000 appears more consistently throughout the document and is more likely correct
-
Performance Specifications
- Issue: The instruction timing varies across sections
- Resolution: Hardware timing likely varies between revisions; consider timings as approximate
-
Memory Layout
- Contradiction: Some sections suggest ROM is 0x0000-0x3FFF, while others imply different layouts
- Assessment: ROM starting at 0x0000 is consistent, but size may vary by implementation
-
Interrupt Configuration
- Contradiction: Different interrupt control register addresses mentioned
- Assessment: 0xFF82/0xFF84 appear most consistently and are likely correct
-
"Undocumented Hardware Features"
- Speculative: Many registers described in 0xFF40-0xFFDF range lack verification
- Speculative: Secret MSW bit patterns (0xDEAD, 0xF001, 0xA55A) may be speculative
- Speculative: Hardware random number generator (0xFF4A-0xFF4B) lacks verification
-
"Hidden Instruction Behaviors"
- Speculative: Instruction fusion claims and pipeline behavior descriptions may be empirical observations rather than guaranteed behaviors
- Speculative: Microcode-level optimizations are likely inferred rather than documented
-
"Advanced Memory Management Features"
- Speculative: Context switch acceleration via 0xFF68 register lacks verification
- Speculative: Shadow TLB access via 0xFF70-0xFF7F needs confirmation
-
"Undocumented Compiler Features"
- Speculative: Many "attribute" features and pragmas may be unsupported
- Speculative: Internal compiler behavior could vary between versions
-
Memory Management
- True: Respect 2KB page boundaries for memory operations
- True: Ensure 16-bit values are aligned on even addresses
- True: Follow documented page table format (V,W,P,X bits)
-
Performance Optimization
- True: Use register operations where possible
- True: Align code to even addresses
- True: Prefer sequential memory access in ascending order
- True: Avoid division operations (very slow)
-
I/O Programming
- True: Check UART status before writing (no hardware flow control)
- True: Follow documented IDE/CF interface protocols
- True: Use documented timer programming sequences
-
System Programming
-
True: Follow standard linking order:
crt0.o, user_objects, -lspecialized, -lc, -lm, -le, crtn.o
-
True: Run
ranlib
after modifying libraries - True: Use message-passing for system calls
-
True: Follow standard linking order:
The Magic-1 documentation contains a solid core of reliable information about the architecture and programming model. However, significant portions describing "undocumented" or "hidden" features should be approached with caution. These sections may represent reverse-engineered behavior or implementation-specific details that could change.
For critical applications, programmers should rely primarily on the confirmed information and test carefully before depending on any "undocumented" features. The most authoritative source would be direct communication with the architecture's creator, Bill Buzbee, or the official Magic-1 documentation and source code repositories.