Proposed architecture - aifoundry-org/erbium GitHub Wiki

Introduction

Erbium is the first community-driven tapeout merging ET RTL with MRAM and thus creating a basis for the next generation ET SOC design. It can be deployed either in a traditional configuration with the host CPU accessing Erbium as an Intelligent RAM (replacing SRAM and Flash) via xSPI OR as a self-hosted array of microcontrollers (with or without a host CPU). Given the somewhat self-contained nature of the 2nd kind of deployment it also happens to be a potentially really great testbed for testing SoC1.5 AKA Crookshanks architecture we are considering.

From the commercial point of view, the goal of Erbium is not to disrupt the microcontroller/NPU market (although that would be nice -- hi there Arduino folks!), but rather to allow us to test various hypothesis in the software and hardware domains quickly enough to make sure we're ready for much bigger tapeouts.

Erbium Hardware Features

Processor(s)
- Full ET Neighborhood of 8 Minion cores running at 1GHz
- 1 TOPS compute power
  - these should be peak ops per cycle per Minion (and per lane, there are 8 lanes per Minion):
    - fp32: 16 ops (per lane: 1 TFMA block does 2 ops)
    - fp16: 32 ops (per lane: 1 TFMA block does 4 ops)
    - int8: 128 ops (per lane: 2 TIMA blocks do 8 ops each, 16 ops total)
- 32KB shared instruction cache (2 L0 caches x 4 minions, 1 L1 cache)
- interrupt controller (1 GPIO ping to the PLIC)
"CIM" features
- RNG
Memory
- 16MB MRAM configured 64 bit wide (total 79 bits - 14 bits for 2bit ECC and 1 for col redundancy)
  - Internal configuration is as follows:
    - 4 banks (each bank is 4MB)
      - 4 stripes per bank
        
        2 instances per stripe
        
        4 blocks per instance
        
        2 planes per block
        
        512 wordlines per plane ( plus 12 wordlines for row redundancy and 1 OTP wordline)
        
        16 words per wordline ( plus 79 bits for reference )
        
        each word is 64 bits
  - Read access < 3ns
  - Write cycle ~20ns
    - 3ns preRead, 14ns write, 3ns verify (current planned first pass write fail rate 200ppm)
    - If verify fails rewrite only failing bits. Up to 8 retries
    - While write is active busy signal is high. When write is complete busy signal go low
I/O
- Hyperbus to the Host
- I2C/SPI to the Neighborhood
- GPIOs
  - Interupt
Power
- Budget: 0.5W
- IO supply for Erbium is 1.8V
- Requires 1.8V and 0.8V for operation
Die size estimate 4.2m x 4.2m
TSMC 16nm fabrication
Package: 64 pin, 9x9 QFN
Evaluation board?
1. Raspberry Pi based?
2. NXP micro based?
3. Maybe small FPGA based?
Still to work out:
- Clks
- Timers

Platform roadmap to consider for Erbium

as was pointed out by Ying "Most of the et-soc-1 verification was done using Asm or C code, hence we needed a toolchain. On top of that, we used a cosim monitor that kept track of all the arch state changes and compared them against BEMU (golden simulator)". We need to start capturing this workflow somehow to prove our open source software/hardware co-design
sysemu becomes our test vehicle for quickly prototyping software layers on top of Erbium. This is one more open source software/hardware co-design proof point that we need to achieve in the community. This includes simulating not just the Neighborhood compute side of it, but also the behaviour of the software APIs for things like MRAM and I/O.
as was pointed out by Gianluca we may actually prototype "to have a host runtime that behaves like the service processor and sending ops through hyperbus, they have a mailbox, so we could even run the machine+master+worker minion. Where the master minion would be only one, and would also have the control of mram etc (and may end up compute on idle, possibly)"
Neighborhood RTL is expected to be extracted and FPGA'able (potentially an entire Erbium itself could be FPGA'able)
Based on the Erbium work and the item above we expect Neighborhood RTL to be the first Open Source RTL that drops on the AIFoundry side. This is to follow up on our Open Source RTL commitment
We started some work around exploiting CIM ideas on the software side for quantized KV cache. This is currently not supported by any emulation or hardware simulation framework (see below tho)
An interesting prototyping ideas is to slice up existing SOC1 into the Neighborhood-sized compute unit and load up software that way

System roadmap to consider for Erbium

we need to start figuring out what are the following things that would be supportive of the Erbium design:
- inference frameworks with current candidates to look at being
  - emlearn which maybe too tiny and restrictive but sets a good based line and integration with things like RTOS (Zephyr in this particular case)
  - tinygrad (generating offloading a'la what they do for webgpu support)
  - can ggml be run on small targets?
what are the small, but not trivial models to be run on Erbium. Current candidates:
- YOLO (which one?)
- Depth Anything
- Collection of models from emlearn -- could be too trivial -- but we need to decide
what are the system integration utilities we need to develop to be successful here
- CLI for Erbium profile for sysemu/SoC1?