Architecture #3 - tong-ece-cmu/Unnamed-Simulator GitHub Wiki

Architecture #3

Architecture #3 is a four-stage pipelined RISC-V RV32I CPU written in System Verilog with data forwarding to minimize stall. It's been redesigned from the ground up to accommodate the more complex memory hierarchy that includes a 4KiB cache. It's a pipelined CPU, so the entire pipeline need to freeze during the memory read. This is a big disadvantage comparing to the dynamic scheduling architecture with reservation station. However, the pipelined architecture should take less floor space on silicon die.

Brief Cycle Outline
Main Module
Instruction Decode Module
DRAM Module
Cache Module
- Address Fields
- Write Policy

Brief Cycle Outline

At first clock edge, instruction becomes available for the Instruction Decode module.

In the period after, check whether we need to read register file, and read register file.

At the second clock edge, operands and immediate are available, start executing.

In the period after, decode immediate, execute to get result.

At the third clock edge, execution results are available, cache module start.

In the period after, cache module compare tag and return data, read DRAM if needed. If read DRAM, freeze pipeline and delay the fourth clock edge.

At the fourth clock edge, cache read/store results ready. Or the execution result if not LOAD/STORE instruction.

In the period after, write results to register file.

Main Module

This is where each part of the architecture been assembled together. All the modules are declared with implicit port connection, which makes the code significantly cleaner and simpler compares to the previous two architecture. It also generates the clock and reset signal for testing, remove the need for a separated test-bench file.

Instruction Decode Module

This is where most of the control signals are generated. It generates the read and address signal for the register file. It generates the signal for data forwarding.

The read signal for register file is generated by checking the opcode field of the instruction. The address is directly assigned using the address field in the instruction. Because for all the instruction that needs to read register, the address fields are all at the same position.

The data forwarding is also done by checking the instruction.

DRAM Module

The DRAM module is used to model the delay for the off-chip memory. It receives the signal from the Cache module. After the third clock edge, the cache will have hardware to detect cache miss, if missing, it will notify the memory to start reading and delaying the fourth clock edge. After the DRAM finish fetching data, it will notify the cache and cache will then notify the rest of the architecture to resume processing.

This DRAM module uses 8-bit wide data line. Different brands uses different data width, Samsung SDDR3 memory chip uses 8-bit data line, while other brands of SDDR3 used 16-bit. We are mostly interested in the delay in this model, so 8-bit data width is chosen for its simplicity.

It will receive the signal from cache each cycle, it can be IDLE, READ, or WRITE. After receiving the signal, the dram_ready signal drop to low. Now it's in working state. The state machine start up. The state machine is used to model the delay of the memory and the transferring of cache data block.

There is a parameter to set the LATENCY of the DRAM. The state machine will wait for LATENCY amount of clock cycle, then start transferring data. Each cache data block has 32 bytes, so it takes 32 cycles for the DRAM to transfer all the data.

The data from DRAM is going straight to the cache data flip-flops, with no flip-flops in between. Not sure how realistic it is, the real chip may have some buffers in-between to amplify the signal. But they also just put some huge transistors in there to drive the wire. It's design dependent. In the end, they all come down to some number of LATENCY, which we modeled here.

The DRAM currently has 64 bytes of storage space. It's tiny. We can change it to something bigger, it's quite trivial. More importantly, we need bigger program, much bigger than a few lines of assembly code, that will exercise all those memories.

Cache Module

The cache module is responsible for generating the address signal for the DRAM and keeping track of the used data block.

Address Fields

Recovering from BSOD.

Write Policy

There are a few cases we need to think about. When we have a LOAD instruction and cache hit. When we have a LOAD instruction and cache miss. When we have a Store instruction and cache hit. When we have a Store instruction and cache miss.

PLAN A

When we have a LOAD instruction and cache hit. We just read from the cache.
When we have a LOAD instruction and cache miss. Write old cache data if valid, tell the DRAM to read. Store DRAM data and present the data to CPU.
When we have a Store instruction and cache hit. We just put the data in cache.
When we have a Store instruction and cache miss. Write old cache data if valid, tell the DRAM to read data. Stroe DRAM data in cache and overwrite it.

PLAN B

When we have a LOAD instruction and cache hit. We just read from the cache.
When we have a LOAD instruction and cache miss. Tell the DRAM to read. Store DRAM data and present the data to CPU.
When we have a Store instruction and cache hit. We write the whole block with new data to DRAM.
When we have a Store instruction and cache miss. Tell the DRAM to read data. Stroe DRAM data in cache and write block with new data back to DRAM.

So PLAN A is 0, 2, 0, 2 and PLAN B is 0, 1, 1, 2. The number indicates the DRAM access time. PLAN B is optimized for reading while PLAN A is equal share in delay. It's possible to devise a PLAN that optimize for writing, but that would require a lot more change in hardware structure.

We implemented PLAN A, it's more balanced and gives the programmer more predictable delay.