Implementing an Instruction Cache for Magic‐1 - retrotruestory/M1DEV GitHub Wiki

Implementing an Instruction Cache for Magic-1: Critical Technical Details

Adding an instruction cache to Magic-1 would be one of the most impactful architectural improvements. Based on the verified architecture documentation, here's a detailed technical implementation approach:

Current Architecture Constraints

The Magic-1 currently has:

  • No hardware cache
  • 16-bit address space (64KB total)
  • 2KB page size
  • Memory-mapped I/O at 0xF800-0xFFFF
  • Three general-purpose registers (a, b, c)
  • Limited microcode space

Instruction Cache Design

1. Cache Structure

// Simplified Verilog description of proposed cache
module instruction_cache (
    input [15:0] address,          // Instruction address
    input clock, reset,
    output [15:0] instruction,     // Fetched instruction
    output hit                     // Cache hit indicator
);
    // Direct-mapped cache with 256 entries (512 bytes)
    // Each entry contains a 16-bit instruction and tag
    reg [15:0] cache_data[255:0];  // Instruction storage
    reg [7:0] cache_tag[255:0];    // Tag bits (upper 8 bits of address)
    reg cache_valid[255:0];        // Valid bits

    // Internal signals
    wire [7:0] index;              // Cache line index
    wire [7:0] tag;                // Address tag
    
    // Address breakdown
    assign index = address[7:0];   // Lower 8 bits form the index
    assign tag = address[15:8];    // Upper 8 bits form the tag

    // Cache hit detection
    assign hit = cache_valid[index] && (cache_tag[index] == tag);
    
    // Output data on hit
    assign instruction = hit ? cache_data[index] : 16'hZZZZ;
    
    // Cache fill logic would be implemented here
endmodule

2. Integration with Instruction Fetch

// Integration with existing fetch logic
module fetch_unit (
    input [15:0] pc,              // Program counter
    output [15:0] instruction     // Instruction to execute
);
    wire cache_hit;
    wire [15:0] cache_instruction;
    wire [15:0] memory_instruction;
    
    // Try cache first
    instruction_cache icache (
        .address(pc),
        .clock(system_clock),
        .reset(system_reset),
        .instruction(cache_instruction),
        .hit(cache_hit)
    );
    
    // Only access memory on cache miss
    memory_access mem_access (
        .address(pc),
        .data_out(memory_instruction),
        .enable(!cache_hit)       // Only enable on miss
    );
    
    // Select instruction source based on hit/miss
    assign instruction = cache_hit ? cache_instruction : memory_instruction;
    
    // Cache fill logic on miss
    always @(posedge system_clock) begin
        if (!cache_hit) begin
            // Fill cache with fetched instruction
            // (Actual implementation logic here)
        end
    end
endmodule

3. Cache Management

Cache Control Registers

// New memory-mapped control registers for cache
#define ICACHE_CTRL_REG    (*(volatile uint16_t*)0xFF50)
#define ICACHE_STAT_REG    (*(volatile uint16_t*)0xFF52)
#define ICACHE_FLUSH_ADDR  (*(volatile uint16_t*)0xFF54)

// Control register bits
#define ICACHE_ENABLE      0x0001  // Enable cache
#define ICACHE_FLUSH       0x0002  // Flush entire cache
#define ICACHE_STATS       0x0004  // Enable hit/miss statistics
#define ICACHE_LINE_INV    0x0008  // Invalidate single line

// Status register bits
#define ICACHE_READY       0x0001  // Cache ready for operation
#define ICACHE_MISS_CNT    0xFF00  // High byte: miss counter
#define ICACHE_HIT_CNT     0x00FF  // Low byte: hit counter

Software Control Functions

// Enable instruction cache
void icache_enable() {
    ICACHE_CTRL_REG |= ICACHE_ENABLE;
}

// Disable instruction cache
void icache_disable() {
    ICACHE_CTRL_REG &= ~ICACHE_ENABLE;
}

// Flush entire cache
void icache_flush() {
    ICACHE_CTRL_REG |= ICACHE_FLUSH;
    while (!(ICACHE_STAT_REG & ICACHE_READY)) { } // Wait for completion
    ICACHE_CTRL_REG &= ~ICACHE_FLUSH;
}

// Invalidate specific cache line
void icache_invalidate_line(uint16_t address) {
    ICACHE_FLUSH_ADDR = address;
    ICACHE_CTRL_REG |= ICACHE_LINE_INV;
    ICACHE_CTRL_REG &= ~ICACHE_LINE_INV;
}

4. Hardware Implementation Requirements

Cache Parameters

  • Size: 512 bytes (256 entries × 16 bits per instruction)
    • Small enough to fit on FPGA without excessive resources
    • Large enough to capture common loops and function bodies
  • Organization: Direct-mapped (simplest to implement)
    • Each address maps to exactly one cache location
    • Index = address[7:0], Tag = address[15:8]
  • Policy: Write-through (simplest to maintain coherency)
    • All writes go directly to memory
    • Cache only used for instruction fetches

Critical Hardware Constraints

  1. Device Page Handling:

    // Device space (0xF800-0xFFFF) must bypass cache
    wire device_space = address[15:11] == 5'b11111;
    assign use_cache = !device_space;
    
  2. Paging Interaction:

    // Cache must respect page boundaries
    // Flush on page table changes
    always @(posedge page_table_write) begin
        // Invalidate all entries
        for (int i = 0; i < 256; i++) begin
            cache_valid[i] <= 1'b0;
        end
    end
    
  3. Timing Constraints:

    • Cache lookup must complete within one clock cycle
    • Cache fill can take multiple cycles
    • Critical path: tag comparison → hit determination → instruction selection

5. Microcode Changes

The Magic-1's microcode would need minimal changes:

// Add cache control sequence
cache_fill:
    mem_read         // Read memory
    cache_write      // Write to cache
    next_instr       // Continue
    
// Modify fetch sequence to check cache
fetch:
    cache_check      // Check cache first
    br.hit next      // If hit, continue
    goto cache_fill  // If miss, fill cache
next:
    decode           // Decode instruction
    execute          // Execute instruction

6. Performance Implications

Based on typical program behavior:

  • Expected Hit Rate: ~85-90% for most code
  • Performance Improvement:
    • ~40-60% overall speedup for computation-heavy code
    • ~10-20% for I/O bound code
  • Memory Traffic Reduction: ~75-80% fewer instruction fetches

7. Technical Implementation Challenges

  1. Cache Coherency:

    • Self-modifying code needs special handling
    • Solution: Flush cache lines on memory writes to code pages
  2. Page Transitions:

    • Cache must be flushed when page tables change
    • Add a signal from the paging unit to trigger cache flush
  3. Context Switching:

    • Cache could be shared across contexts or flushed on context switch
    • Recommended: share cache but flush on major context switches
  4. Resource Constraints:

    • FPGA resource usage must remain reasonable
    • Time-multiplexed tag comparison can reduce logic requirements
  5. Magic-1 Microcode Integration:

    • Limited microcode space requires efficient implementation
    • May need to repurpose existing microcode operations

8. Integration with Bootloader

The bootloader would need modifications to initialize the cache:

// During system initialization
void setup_cache() {
    // Initial flush to clear any undefined state
    icache_flush();
    
    // Enable cache after memory system is initialized
    icache_enable();
    
    // Register cache flush handler for page table changes
    register_pt_change_handler(icache_flush);
}

9. Critical Technical Constraints

  1. Word Alignment:

    • The cache must maintain Magic-1's requirement for 16-bit word alignment
    • All cached instructions must be aligned to even addresses
  2. Device Page Access:

    • Instructions in device space (0xF800-0xFFFF) must never be cached
    • Hardware must detect device address ranges and bypass cache
  3. Paging Interaction:

    • Cache uses virtual addresses (post-translation)
    • Cache must be flushed on page table base (PTB) register changes
  4. Interrupt Handling:

    • Interrupt vector fetches must still work with caching
    • Consider pre-loading interrupt vectors into cache

This implementation approach respects Magic-1's existing architecture while providing a significant performance boost through efficient instruction caching, particularly for code with loops and frequently executed functions.