The RISC V Reader Notes - yszheda/wiki GitHub Wiki

Chap 1. Why RISC-V?

1.2 Modular vs. Incremental ISAs

At the core is a base ISA, called RV32I, which runs a full software stack. RV32I is frozen and will never change, which gives compiler writers, operating system developers, and assembly language programmers a stable target.

1.3 ISA Design 101

  1. cost (US dollar coin icon)
  2. simplicity (wheel)
  3. performance (speedometer)
  4. isolation of architecture from implementation (detached halves of a circle)
  5. room for growth (accordion)
  6. program size (opposing arrows compressing line)
  7. ease of programming / compiling / linking (children’s blocks for “as easy as ABC”).

  • macro-fusion: High-end processors can gain performance by combining simple instructions together without burdening all lower-end implementations with a larger, more complicated ISA.

  • It’s useful for an ISA to support position independent code (PIC), because it supports dynamic linking, since shared library code can reside at different addresses in different programs. PC-relative branches and data addressing are a boon to PIC.

Chap 2. RV32I: RISC-V Base Integer ISA

2.3 RV32I Registers

  • Dedicating a register to zero is a surprisingly large factor in simplifying the RISC-V ISA.
  • The PC is one of ARM-32’s 16 registers, which means that any instruction that changes a register may also as a side effect be a branch instruction. The PC as a register complicates hardware branch prediction, whose accuracy is vital for good pipelined-performance, since every instruction might be a branch instead of 10–20% of instructions executed in programs for typical ISAs. It also means one less general-purpose register.

2.4 RV32I Integer Computation

  • First, there are no byte or half-word integer computation operations. The operations are always the full register width. Memory accesses take orders of magnitude more energy than arithmetic operations, so narrow data accesses can save significant energy, but narrow operations do not.
  • Nor does RV32I include multiply and divide; they comprise the optional RV32M extension.
  • RV32I also omits rotate instructions and detection of integer arithmetic overflow.

2.5 RV32I Loads and Stores

  • The only addressing mode for loads and stores is adding a sign-extended 12-bit immediate to a register, called displacement addressing mode in x86-32.
  • Unlike x86-32, RISC-V has no special stack instructions. By using one of the 31 registers as the stack pointer, the standard addressing mode gets most of the benefits of push and pop instructions without the added ISA complexity.
  • Unlike MIPS-32, RISC-V rejected delayed load.
  • While ARM-32 and MIPS-32 require data to be aligned naturally to data-sized boundaries in memory, RISC-V does not.

2.6 RV32I Conditional Branch

the branch addressing mode multiplies the 12-bit immediate by 2, sign-extends it, and then adds it to the PC. PC-relative addressing helps with position independent code and thereby reduces the work of the linker and loader.

  • RISC-V excluded the infamous delayed branch of MIPS-32, Oracle SPARC, and others.
  • It also avoided the condition codes of ARM-32 and x86-32 for conditional branches. They add extra state that is implicitly set by most instructions, which needlessly complicate the dependence calculation for out-of-order execution.
  • Finally, it omitted the loop instructions of the x86-32: loop, loope, loopz, loopne, loopnz.

Elaboration: Reading the PC

  • The current PC can be obtained by setting the U-immediate field of auipc to 0.

Elaboration: Software checking of overflow

  • unsigned
addu t0, t1, t2
bltu t0, t1, overflow
  • signed
add t0, t1, t2
slti t3, t2, 0        # t3 = (t2<0)
slt t4, t0, t1        # t4 = (t1+t2<t1)
bne t3, t4, overflow  # overflow if (t2<0) && (t1+t2>=t1)
                      # || (t2>=0) && (t1+t2<t1)

2.7 RV32I Unconditional Jump

The single jump and link instruction (jal) serves dual functions:

  1. To support procedure calls, it saves the address of the next instruction PC+4 into the destination register, normally the return address register ra.
  2. To support unconditional jumps, we use the zero register (x0) instead of ra as the destination register, as x0 can’t be changed.
  • RV32I shunned intricate procedure call instructions, such as the enter and leave instructions of the x86-32, or register windows as found in the Intel Itanium, Oracle SPARC, and Cadence Tensilica.
  • Register windows: accelerated function call by having many more registers than 32. A new function would get a new set or window of 32 registers on a call. To pass arguments, the windows overlapped, meaning some registers were in two adjacent windows.

2.8 RV32I Miscellaneous

  • RISC-V uses memory mapped I/O instead of the in, ins, insb, insw and out, outs, outsb, outsw instructions of the x86-32.
  • It supports strings using byte loads and stores instead of the 16 special string instructions of the x86-32 rep, movs, coms, scas, lods, ....

Chap 3. RISC-V Assembly Language

3.2 Calling convention

  1. Place the arguments where the function can access them.
  2. Jump to the function (using RV32I’s jal).
  3. Acquire local storage resources the function needs, saving registers as required.
  4. Perform the desired task of the function.
  5. Place the function result value where the calling program can access it, restore any registers, and release any local storage resources.
  6. Since a function can be called from several points in a program, return control to the point of origin (using ret).
  • have some registers that are not guaranteed to be preserved across a function call, called temporary registers, and some that are, called saved registers.
  • Functions that avoid calling other functions are called leaf functions. When a leaf function has only a few arguments and local variables, we can keep everything in registers without “spilling” any to memory.

3.3 Assembly

3.4 Linker

Elaboration: Linker relaxation

3.5 Static vs. Dynamic Linking

Chap 4. RV32M: Multiply and Divide

Chap 5. RV32F and RV32D: Single- and Double-Precision Floating Point

5.2 Floating-Point Registers

Elaboration: RV32FD allows the rounding mode to be set per instruction. (static rounding)

5.3 Floating-Point Loads, Stores, and Arithmetic

Instead of floating-point branch instructions, RV32F and RV32D supply comparison instructions that set an integer register to 1 or 0 based on comparison of two floating-point registers: feq.s, feq.d, flt.s, flt.d, fle.s, fle.d.

Chap 6. RV32A: Atomic Instructions

6.1 Introduction

RV32A has two types of atomic operations for synchronization:

  • atomic memory operations (AMO)
  • load reserved / store conditional

The AMO instructions atomically perform an operation on an operand in memory and set the destination register to the original memory value. Atomic means there can be no interrupt between the read and the write of memory, nor could other processors modify the memory value between the memory read and write of the AMO instruction.

Load reserved reads a word from memory, writes it to the destination register, and records a reservation on that word in memory.

Store conditional stores a word at the address in a source register provided there exists a load reservation on that memory address. It writes zero to the destination register if the store succeeded, or a nonzero error code otherwise.

# Compare-and-swap (CAS) memory word M[a0] using lr/sc.
# Expected old value in a1; desired new value in a2.
0: 100526af lr.w a3,(a0)       # Load old value
4: 06b69e63 bne a3,a1,80       # Old value equals a1?
8: 18c526af sc.w a3,a2,(a0)    # Swap in new value if so
c: fe069ae3 bnez a3,0          # Retry if store failed
... code following successful CAS goes here ...
80:                            # Unsuccessful CAS.

The rationale for also having AMO instructions is that they scale better to large multiprocessor systems than load reserved and store conditional. They can also be used to implement reduction operations efficiently. AMOs are useful as well for communicating with I/O devices, because they perform a read and a write in a single atomic bus transaction. This atomicity can both simplify device drivers and improve I/O performance.

# Critical section guarded by test-and-set spinlock using an AMO.
0: 00100293 li t0,1                    # Initialize lock value
4: 0c55232f amoswap.w.aq t1,t0,(a0)    # Attempt to acquire lock
8: fe031ee3 bnez t1,4                  # Retry if unsuccessful
... critical section goes here ...
20: 0a05202f amoswap.w.rl x0,x0,(a0)   # Release lock.

Elaboration: Memory consistency models

RISC-V has a relaxed memory consistency model, so other threads may view some memory accesses out of order. An atomic operation with the aq bit (acquire bit) set guarantees that other threadswill see the AMO in-order with subsequent memory accesses. If the rl` bit (release bit) is set, other threads will see the atomic operation in-order with previous memory accesses.

Chap 7. RV32C: Compressed Instructions

RV32C takes a novel approach: every short instruction must map to one single standard 32-bit RISC-V instruction. Moreover, only the assembler and linker are aware of the 16-bit instructions, and it is up to them to replace a wide instruction with its narrow cousin.

Chap 8. RV32V: Vector

8.3 Vector Registers and Dynamic Typing

RV32V takes the novel approach of associating the data type and length with the vector registers rather than with the instruction opcodes.

8.4 Vector Loads and Stores

  • dense arrays
    • single-dimension arrays: vld, vst
    • multi-dimension arrays (strided data transfers): vlds, vsts
  • sparse arrays (indexed data transfers / gather and scatter): vldx, vstx

8.10 Concluding Remarks

  • SIMD vs. vector:
    • dynamic instruction count
    • SIMD violates the ISA design principle of isolating the architecture from implementation

Chap 9. RV64: 64-bit Address Instructions

Chap 10. RV32/64 Privileged Architecture

10.2 Machine Mode for Simple Embedded Systems

Machine mode, abbreviated as M-mode, is the most privileged mode that a RISC-V hart (hardware thread) can execute in. Harts running in M-mode have full access to memory, I/O, and low-level system features necessary to boot and configure the system. As such, it is the only privilege mode that all standard RISC-V processors implement; indeed, simple RISC-V microcontrollers support only M-mode.

  • Hart is a contraction of hardware thread. Software threads are timemultiplexed on harts. Most processor cores have only one hart.

The most important feature of machine mode is the ability to intercept and handle exceptions: unusual runtime events.

  1. Synchronous exceptions arise as a result of instruction execution, as when accessing an invalid memory address or executing an instruction with an invalid opcode.
  2. Interrupts are external events that are asynchronous with the instruction stream, like a mouse button click.

Synchronous exceptions:

  1. Access fault exceptions arise when a physical memory address doesn’t support the access type—for example, attempting to store to a ROM.
  2. Breakpoint exceptions arise from executing an ebreak instruction, or when an address or datum matches a debug trigger.
  3. Environment call exceptions arise from executing an ecall instruction.
  4. Illegal instruction exceptions result from decoding an invalid opcode.
  5. Misaligned address exceptions occur when the effective address isn’t divisible by the access size—for example, amoadd.w with an address of 0x12.

There are three standard sources of interrupts:

  1. software. Software interrupts are triggered by storing to a memory-mapped register and are generally used by one hart to interrupt another hart, a mechanism other architectures refer to as an interprocessor interrupt.
  2. timer. Timer interrupts are raised when a hart’s time comparator, a memory-mapped register named mtimecmp, exceeds the real-time counter mtime.
  3. external

10.3 Machine-Mode Exception Handling

Eight control and status registers (CSRs):

  1. mtvec, Machine Trap Vector, holds the address the processor jumps to when an exception occurs.
  2. mepc, Machine Exception PC, points to the instruction where the exception occurred.
  3. mcause, Machine Exception Cause, indicates which exception occurred.
  4. mie, Machine Interrupt Enable, lists which interrupts the processor can take and which it must ignore.
  5. mip, Machine Interrupt Pending, lists the interrupts currently pending.
  6. mtval, Machine Trap Value, holds additional trap information: the faulting address for address exceptions, the instruction itself for illegal instruction exceptions, and zero for other exceptions.
  7. mscratch, Machine Scratch, holds one word of data for temporary storage.
  8. mstatus, Machine Status, holds the global interrupt enable, along with a plethora of other state

When a hart takes an exception, the hardware atomically undergoes several state transitions:

  1. The PC of the exceptional instruction is preserved in mepc, and the PC is set to mtvec. (For synchronous exceptions, mepc points to the instruction that caused the exception; for interrupts, it points where execution should resume after the interrupt is handled.)
  2. mcause is set to the exception cause, and mtval is set to the faulting address or some other exception-specific word of information.
  3. Interrupts are disabled by setting MIE=0 in the mstatus CSR, and the previous value of MIE is preserved in MPIE.
  4. The pre-exception privilege mode is preserved in mstatusMPP field, and the privilege mode is changed to M. (If the processor only implements M-mode, this step is effectively skipped.)

  • Elaboration: wfi works whether or not interrupts are globally enabled.

10.4 User Mode and Process Isolation in Embedded Systems

an additional privilege mode, User mode (U-mode), denies access to these features, generating an illegal instruction exception when attempting to use an M-mode instruction or CSR.

Untrusted code must also be restricted to access only its own memory. Processors that implement M and U modes have a feature called Physical Memory Protection (PMP), which allows M-mode to specify which memory addresses U-mode can access. When a processor in U-mode attempts to fetch an instruction, or execute a load or store, the address is compared against all of the PMP address registers. If the address is greater than or equal to PMP address i, but less than PMP address i+1, then PMPi+1’s configuration register decides whether that access may proceed; otherwise, it raises an access exception.

10.5 Supervisor Mode for Modern Operating Systems

PMP scheme has several drawbacks that limit its use in general-purpose computing.

  • Since PMP supports only a fixed number of memory regions, it doesn’t scale to complex applications.
  • Since these regions must be contiguous in physical memory, the system can suffer from memory fragmentation.
  • Finally, PMP doesn’t efficiently support paging to secondary storage.

More sophisticated RISC-V processors handle these problems the same way as nearly all general-purpose architectures: using page-based virtual memory. This feature forms the core of supervisor mode (S-mode), an optional privilege mode designed to support modern Unixlike operating systems, such as Linux, FreeBSD, and Windows.

RISC-V provides an exception delegation mechanism, by which interrupts and synchronous exceptions can be delegated to S-mode selectively, bypassing M-mode software altogether.


Why not unconditionally delegate interrupts to S-mode? One reason is virtualization: if M-mode wants to virtualize a device for S-mode, its interrupts should go to M-mode, not S-mode.


If a hart takes an exception and it is delegated to S-mode, the hardware atomically undergoes several similar state transitions, using S-mode CSRs instead of M-mode ones:

  • The PC of the exceptional instruction is preserved in sepc, and the PC is set to stvec.
  • scause is set to the exception cause, and stval is set to the faulting address or some other exception-specific word of information.
  • Interrupts are disabled by setting SIE=0 in the sstatus CSR, and the previous value of SIE is preserved in SPIE.
  • The pre-exception privilege mode is preserved in sstatusSPP field, and the privilege mode is changed to S.

10.6 Page-Based Virtual Memory

page-table entry (PTE)

  • The U bit indicates whether this page is a user page. If U=0, U-mode cannot access this page, but S-mode can. If U=1, U-mode can access this page, but S-mode cannot.
  • The G bit indicates this mapping exists in all virtual-address spaces, information the hardware can use to improve address-translation performance. It is typically only used for pages that belong to the operating system.
  • The A bit indicates whether the page has been accessed since the last time the A bit was cleared.
  • The D bit indicates whether the page has been dirtied (i.e., written) since the last time the D bit was cleared.
  • The PPN field holds a physical page number, which is part of a physical address.

The OS relies on the A and D bits to decide which pages to swap to secondary storage. Periodically clearing the A bits helps the OS approximate which pages have been least recently used. The D bit indicates a page is even more expensive to swap out, because it must be written back to secondary storage.


An S-mode CSR, satp (Supervisor Address Translation and Protection), controls the paging system.

  1. The MODE field enables paging and selects the page-table depth
  2. The ASID (Address Space Identifier) field is optional and can be used to reduce the cost of context switches.
  3. the PPN field holds the physical address of the root page table, divided by the 4 KiB page size. Typically, M-mode software will write zero to satp before entering S-mode for the first time, disabling paging, then S-mode software will write it again after setting up the page tables.

Diagram of the Sv32 address-translation process.

Elaboration: Address-translation cache coherence in multiprocessors

sfence.vma only affects the address-translation hardware for the hart that executed the instruction. When a hart changes a page table that another hart is using, the first hart must use an interprocessor interrupt to inform the second hart that it should execute an sfence.vma instruction. This procedure is often referred to as a TLB shootdown.