The RISC V Reader Notes - yszheda/wiki GitHub Wiki

Chap 1. Why RISC-V?

1.2 Modular vs. Incremental ISAs

At the core is a base ISA, called RV32I, which runs a full software stack. RV32I is frozen and will never change, which gives compiler writers, operating system developers, and assembly language programmers a stable target.

1.3 ISA Design 101

cost (US dollar coin icon)
simplicity (wheel)
performance (speedometer)
isolation of architecture from implementation (detached halves of a circle)
room for growth (accordion)
program size (opposing arrows compressing line)
ease of programming / compiling / linking (children’s blocks for “as easy as ABC”).

macro-fusion: High-end processors can gain performance by combining simple instructions together without burdening all lower-end implementations with a larger, more complicated ISA.
It’s useful for an ISA to support position independent code (PIC), because it supports dynamic linking, since shared library code can reside at different addresses in different programs. PC-relative branches and data addressing are a boon to PIC.

Chap 2. RV32I: RISC-V Base Integer ISA

2.2 RV32I Instruction formats

WXWorkCapture_17494599683965

WXWorkCapture_1749460015338

there are only six formats and all instructions are 32 bits long => simplifies instruction decoding.
RISC-V instructions offer three register operands, rather than having one field shared for source and destination, as with x86-32. When an operation naturally has three distinct operands but the ISA provides only a two-operand instruction, the compiler or assembly language programmer must use an extra move instruction to preserve the destination operand.
in RISC-V the specifiers of the registers to be read and written are always in the same location in all instructions, which means the register accesses can begin before decoding the instruction.
the immediate fields in these formats are always sign extended, and the sign bit is always in the most significant bit of the instruction. This decision means sign extension of the immediate, which may also be on a critical timing path, can proceed before decoding the instruction.

2.3 RV32I Registers

What’s Different?

Dedicating a register to zero is a surprisingly large factor in simplifying the RISC-V ISA.
The PC is one of ARM-32’s 16 registers, which means that any instruction that changes a register may also as a side effect be a branch instruction. The PC as a register complicates hardware branch prediction, whose accuracy is vital for good pipelined-performance, since every instruction might be a branch instead of 10–20% of instructions executed in programs for typical ISAs. It also means one less general-purpose register.

WXWorkCapture_17494610778535

2.4 RV32I Integer Computation

What’s Different?

First, there are no byte or half-word integer computation operations. The operations are always the full register width. Memory accesses take orders of magnitude more energy than arithmetic operations, so narrow data accesses can save significant energy, but narrow operations do not.
Nor does RV32I include multiply and divide; they comprise the optional RV32M extension.
RV32I also omits rotate instructions and detection of integer arithmetic overflow.

2.5 RV32I Loads and Stores

What’s Different?

The only addressing mode for loads and stores is adding a sign-extended 12-bit immediate to a register, called displacement addressing mode in x86-32.
Unlike x86-32, RISC-V has no special stack instructions. By using one of the 31 registers as the stack pointer, the standard addressing mode gets most of the benefits of push and pop instructions without the added ISA complexity.
Unlike MIPS-32, RISC-V rejected delayed load.
While ARM-32 and MIPS-32 require data to be aligned naturally to data-sized boundaries in memory, RISC-V does not.

2.6 RV32I Conditional Branch

the branch addressing mode multiplies the 12-bit immediate by 2, sign-extends it, and then adds it to the PC. PC-relative addressing helps with position independent code and thereby reduces the work of the linker and loader.

What’s Different?

RISC-V excluded the infamous delayed branch of MIPS-32, Oracle SPARC, and others.
It also avoided the condition codes of ARM-32 and x86-32 for conditional branches. They add extra state that is implicitly set by most instructions, which needlessly complicate the dependence calculation for out-of-order execution.
Finally, it omitted the loop instructions of the x86-32: loop, loope, loopz, loopne, loopnz.

Elaboration: Reading the PC

The current PC can be obtained by setting the U-immediate field of auipc to 0.

Elaboration: Software checking of overflow

unsigned

addu t0, t1, t2
bltu t0, t1, overflow

signed

add t0, t1, t2
slti t3, t2, 0        # t3 = (t2<0)
slt t4, t0, t1        # t4 = (t1+t2<t1)
bne t3, t4, overflow  # overflow if (t2<0) && (t1+t2>=t1)
                      # || (t2>=0) && (t1+t2<t1)

2.7 RV32I Unconditional Jump

The single jump and link instruction (jal) serves dual functions:

To support procedure calls, it saves the address of the next instruction PC+4 into the destination register, normally the return address register ra.
To support unconditional jumps, we use the zero register (x0) instead of ra as the destination register, as x0 can’t be changed.

RV32I shunned intricate procedure call instructions, such as the enter and leave instructions of the x86-32, or register windows as found in the Intel Itanium, Oracle SPARC, and Cadence Tensilica.
Register windows: accelerated function call by having many more registers than 32. A new function would get a new set or window of 32 registers on a call. To pass arguments, the windows overlapped, meaning some registers were in two adjacent windows.

2.8 RV32I Miscellaneous

RISC-V uses memory mapped I/O instead of the in, ins, insb, insw and out, outs, outsb, outsw instructions of the x86-32.
It supports strings using byte loads and stores instead of the 16 special string instructions of the x86-32 rep, movs, coms, scas, lods, ....

Chap 3. RISC-V Assembly Language

3.2 Calling convention

Place the arguments where the function can access them.
Jump to the function (using RV32I’s jal).
Acquire local storage resources the function needs, saving registers as required.
Perform the desired task of the function.
Place the function result value where the calling program can access it, restore any registers, and release any local storage resources.
Since a function can be called from several points in a program, return control to the point of origin (using ret).

have some registers that are not guaranteed to be preserved across a function call, called temporary registers, and some that are, called saved registers.
Functions that avoid calling other functions are called leaf functions. When a leaf function has only a few arguments and local variables, we can keep everything in registers without “spilling” any to memory.

3.3 Assembly

3.4 Linker

Elaboration: Linker relaxation

3.5 Static vs. Dynamic Linking

Chap 4. RV32M: Multiply and Divide

WXWorkCapture_17491095619562

Chap 5. RV32F and RV32D: Single- and Double-Precision Floating Point

WXWorkCapture_17491093938146

4.1 Introduction

What’s Different?

ARM-32 long had multiply but no divide instruction.
MIPS-32 uses special registers (HI and LO) as the sole destination registers for multiply and divide instructions. =>
- it takes an extra move instruction to use the result of the multiply or divide
- The HI and LO registers also increase the architectural state, making it slightly slower to switch between tasks.

Elaboration: `mulh` and `mulhu` can check for overflow in multiplication.

unsigned multiplication: the result of mulhu is 0 => no overflow
signed multiplication: all bits in the result of mulh match the sign bit of the result of mul => no overflow

Elaboration: It’s also easy to check for divide by zero. (`beqz` before the divide)

Elaboration: `mulhsu` is useful for multi-word signed multiplication.

5.2 Floating-Point Registers

Unlike x0 in RV32I, register f0 is not hardwired to 0 but is an alterable register like all the other 31 f registers.

What’s Different?

RV32FD: use 32 separate f registers. The single-precision registers occupy the rightmost half of the 32 double-precision registers.
ARM-32 / MIPS-32: 32 single-precision floating-point registers but only 16 double-precision registers. They both map two single-precision registers into the left and right 32-bit halves of a double-precision register.
x86-32 floating-point arithmetic didn’t have any registers, but used a stack instead (80-bit stack entries).
RV32FD and MIPS-32 can move data directly between floating-point and integer registers.

Elaboration: RV32FD allows the rounding mode to be set per instruction. (static rounding)

5.3 Floating-Point Loads, Stores, and Arithmetic

Instead of floating-point branch instructions, RV32F and RV32D supply comparison instructions that set an integer register to 1 or 0 based on comparison of two floating-point registers: feq.s, feq.d, flt.s, flt.d, fle.s, fle.d.

# Exit if f1 < f2
flt x5, f1, f2 # x5 = 1 if f1 < f2; otherwise x5 = 0
bne x5, x0, Exit # if x5 != 0, jump to Exit

5.5 Miscellaneous Floating-Point Instructions

sign-injection instructions

Float sign inject (fsgnj.s, fsgnj.d): the result’s sign bit is rs2’s sign bit.
Float sign inject negative (fsgnjn.s, fsgnjn.d): the result’s sign bit is the opposite of rs2’s sign bit.
Float sign inject exclusive-or (fsgnjx.s, fsgnjx.d): the sign bit is the XOR of the sign bits of rs1 and rs2.

sign-injection instructions provide three popular floating-point pseudoinstructions

Copy floating-point register:

fmv.s rd,rs is really fsgnj.s rd,rs,rs
fmv.d rd,rs is really fsgnj.d rd,rs,rs

Negate:

fneg.s rd,rs maps to fsgnjn.s rd,rs,rs
fneg.d rd,rs maps to fsgnjn.d rd,rs,rs

Absolute value (since 0 XOR 0 = 0 and 1 XOR 1 = 0):

fabs.s rd,rs becomes fsgnjx.s rd,rs,rs
fabs.d rd,rs becomes fsgnjx.d rd,rs,rs

classify (`fclass.s`, `fclass.d`)

Chap 6. RV32A: Atomic Instructions

WXWorkCapture_17491161051900

6.1 Introduction

RV32A has two types of atomic operations for synchronization:

atomic memory operations (AMO)
load reserved / store conditional

AMOs and LR/SC require naturally aligned memory addresses because it is onerous for hardware to guarantee atomicity across cache-line boundaries.

The AMO instructions atomically perform an operation on an operand in memory and set the destination register to the original memory value. Atomic means there can be no interrupt between the read and the write of memory, nor could other processors modify the memory value between the memory read and write of the AMO instruction.

Load reserved reads a word from memory, writes it to the destination register, and records a reservation on that word in memory.

Store conditional stores a word at the address in a source register provided there exists a load reservation on that memory address. It writes zero to the destination register if the store succeeded, or a nonzero error code otherwise.

# Compare-and-swap (CAS) memory word M[a0] using lr/sc.
# Expected old value in a1; desired new value in a2.
0: 100526af lr.w a3,(a0)       # Load old value
4: 06b69e63 bne a3,a1,80       # Old value equals a1?
8: 18c526af sc.w a3,a2,(a0)    # Swap in new value if so
c: fe069ae3 bnez a3,0          # Retry if store failed
... code following successful CAS goes here ...
80:                            # Unsuccessful CAS.

The rationale for also having AMO instructions is that they scale better to large multiprocessor systems than load reserved and store conditional. They can also be used to implement reduction operations efficiently. AMOs are useful as well for communicating with I/O devices, because they perform a read and a write in a single atomic bus transaction. This atomicity can both simplify device drivers and improve I/O performance.

# Critical section guarded by test-and-set spinlock using an AMO.
0: 00100293 li t0,1                    # Initialize lock value
4: 0c55232f amoswap.w.aq t1,t0,(a0)    # Attempt to acquire lock
8: fe031ee3 bnez t1,4                  # Retry if unsuccessful
... critical section goes here ...
20: 0a05202f amoswap.w.rl x0,x0,(a0)   # Release lock.

Elaboration: Memory consistency models

RISC-V has a relaxed memory consistency model, so other threads may view some memory accesses out of order. An atomic operation with the aq bit (acquire bit) set guarantees that other threads will see the AMO in-order with subsequent memory accesses. If the rl bit (release bit) is set, other threads will see the atomic operation in-order with previous memory accesses.

Chap 7. RV32C: Compressed Instructions

RV32C takes a novel approach: every short instruction must map to one single standard 32-bit RISC-V instruction. Moreover, only the assembler and linker are aware of the 16-bit instructions, and it is up to them to replace a wide instruction with its narrow cousin.

Chap 8. RV32V: Vector

WXWorkCapture_17491162185828

8.1 Introduction

The size of the vector registers is determined by the implementation, rather than baked into the opcode, as with SIMD.
separating the vector length and maximum operations per clock cycle from the instruction encoding is the crux of the vector architecture: the vector microarchitect can flexibly design the data-parallel hardware without affecting the programmer, and the programmer can take advantage of longer vectors without rewriting the code.
vector architectures have many fewer instructions than SIMD architectures.
vector architectures have well-established compiler technology, unlike SIMD.

8.3 Vector Registers and Dynamic Typing

RV32V takes the novel approach of associating the data type and length with the vector registers rather than with the instruction opcodes.

Advantages Dynamic register typing:

slash the number of vector instructions
programs can disable unused vector registers.

Elaboration: RV32V can switch context quickly.

8.4 Vector Loads and Stores

dense arrays
- single-dimension arrays: vld, vst
- multi-dimension arrays (strided data transfers): vlds, vsts
sparse arrays (indexed data transfers / gather and scatter): vldx, vstx

8.6 Conditional Execution of Vector Operations

RV32V provides 8 vector predicate registers (vpi) to act as vector masks.

8.10 Concluding Remarks

SIMD vs. vector:
- dynamic instruction count
- SIMD violates the ISA design principle of isolating the architecture from implementation

Chap 9. RV64: 64-bit Address Instructions

Chap 10. RV32/64 Privileged Architecture

10.2 Machine Mode for Simple Embedded Systems

Machine mode, abbreviated as M-mode, is the most privileged mode that a RISC-V hart (hardware thread) can execute in. Harts running in M-mode have full access to memory, I/O, and low-level system features necessary to boot and configure the system. As such, it is the only privilege mode that all standard RISC-V processors implement; indeed, simple RISC-V microcontrollers support only M-mode.

Hart is a contraction of hardware thread. Software threads are timemultiplexed on harts. Most processor cores have only one hart.

The most important feature of machine mode is the ability to intercept and handle exceptions: unusual runtime events.

Synchronous exceptions arise as a result of instruction execution, as when accessing an invalid memory address or executing an instruction with an invalid opcode.
Interrupts are external events that are asynchronous with the instruction stream, like a mouse button click.

Synchronous exceptions:

Access fault exceptions arise when a physical memory address doesn’t support the access type—for example, attempting to store to a ROM.
Breakpoint exceptions arise from executing an ebreak instruction, or when an address or datum matches a debug trigger.
Environment call exceptions arise from executing an ecall instruction.
Illegal instruction exceptions result from decoding an invalid opcode.
Misaligned address exceptions occur when the effective address isn’t divisible by the access size—for example, amoadd.w with an address of 0x12.

There are three standard sources of interrupts:

software. Software interrupts are triggered by storing to a memory-mapped register and are generally used by one hart to interrupt another hart, a mechanism other architectures refer to as an interprocessor interrupt.
timer. Timer interrupts are raised when a hart’s time comparator, a memory-mapped register named mtimecmp, exceeds the real-time counter mtime.
external

10.3 Machine-Mode Exception Handling

Eight control and status registers (CSRs):

mtvec, Machine Trap Vector, holds the address the processor jumps to when an exception occurs.
mepc, Machine Exception PC, points to the instruction where the exception occurred.
mcause, Machine Exception Cause, indicates which exception occurred.
mie, Machine Interrupt Enable, lists which interrupts the processor can take and which it must ignore.
mip, Machine Interrupt Pending, lists the interrupts currently pending.
mtval, Machine Trap Value, holds additional trap information: the faulting address for address exceptions, the instruction itself for illegal instruction exceptions, and zero for other exceptions.
mscratch, Machine Scratch, holds one word of data for temporary storage.
mstatus, Machine Status, holds the global interrupt enable, along with a plethora of other state

When a hart takes an exception, the hardware atomically undergoes several state transitions:

The PC of the exceptional instruction is preserved in mepc, and the PC is set to mtvec. (For synchronous exceptions, mepc points to the instruction that caused the exception; for interrupts, it points where execution should resume after the interrupt is handled.)
mcause is set to the exception cause, and mtval is set to the faulting address or some other exception-specific word of information.
Interrupts are disabled by setting MIE=0 in the mstatus CSR, and the previous value of MIE is preserved in MPIE.
The pre-exception privilege mode is preserved in mstatus’ MPP field, and the privilege mode is changed to M. (If the processor only implements M-mode, this step is effectively skipped.)

Elaboration: wfi works whether or not interrupts are globally enabled.

10.4 User Mode and Process Isolation in Embedded Systems

an additional privilege mode, User mode (U-mode), denies access to these features, generating an illegal instruction exception when attempting to use an M-mode instruction or CSR.

Untrusted code must also be restricted to access only its own memory. Processors that implement M and U modes have a feature called Physical Memory Protection (PMP), which allows M-mode to specify which memory addresses U-mode can access. When a processor in U-mode attempts to fetch an instruction, or execute a load or store, the address is compared against all of the PMP address registers. If the address is greater than or equal to PMP address i, but less than PMP address i+1, then PMPi+1’s configuration register decides whether that access may proceed; otherwise, it raises an access exception.

10.5 Supervisor Mode for Modern Operating Systems

PMP scheme has several drawbacks that limit its use in general-purpose computing.

Since PMP supports only a fixed number of memory regions, it doesn’t scale to complex applications.
Since these regions must be contiguous in physical memory, the system can suffer from memory fragmentation.
Finally, PMP doesn’t efficiently support paging to secondary storage.

More sophisticated RISC-V processors handle these problems the same way as nearly all general-purpose architectures: using page-based virtual memory. This feature forms the core of supervisor mode (S-mode), an optional privilege mode designed to support modern Unixlike operating systems, such as Linux, FreeBSD, and Windows.

RISC-V provides an exception delegation mechanism, by which interrupts and synchronous exceptions can be delegated to S-mode selectively, bypassing M-mode software altogether.

Why not unconditionally delegate interrupts to S-mode? One reason is virtualization: if M-mode wants to virtualize a device for S-mode, its interrupts should go to M-mode, not S-mode.

If a hart takes an exception and it is delegated to S-mode, the hardware atomically undergoes several similar state transitions, using S-mode CSRs instead of M-mode ones:

The PC of the exceptional instruction is preserved in sepc, and the PC is set to stvec.
scause is set to the exception cause, and stval is set to the faulting address or some other exception-specific word of information.
Interrupts are disabled by setting SIE=0 in the sstatus CSR, and the previous value of SIE is preserved in SPIE.
The pre-exception privilege mode is preserved in sstatus’ SPP field, and the privilege mode is changed to S.

10.6 Page-Based Virtual Memory

page-table entry (PTE)

The U bit indicates whether this page is a user page. If U=0, U-mode cannot access this page, but S-mode can. If U=1, U-mode can access this page, but S-mode cannot.
The G bit indicates this mapping exists in all virtual-address spaces, information the hardware can use to improve address-translation performance. It is typically only used for pages that belong to the operating system.
The A bit indicates whether the page has been accessed since the last time the A bit was cleared.
The D bit indicates whether the page has been dirtied (i.e., written) since the last time the D bit was cleared.
The PPN field holds a physical page number, which is part of a physical address.

The OS relies on the A and D bits to decide which pages to swap to secondary storage. Periodically clearing the A bits helps the OS approximate which pages have been least recently used. The D bit indicates a page is even more expensive to swap out, because it must be written back to secondary storage.

An S-mode CSR, satp (Supervisor Address Translation and Protection), controls the paging system.

The MODE field enables paging and selects the page-table depth
The ASID (Address Space Identifier) field is optional and can be used to reduce the cost of context switches.
the PPN field holds the physical address of the root page table, divided by the 4 KiB page size. Typically, M-mode software will write zero to satp before entering S-mode for the first time, disabling paging, then S-mode software will write it again after setting up the page tables.

Diagram of the Sv32 address-translation process.

Elaboration: Address-translation cache coherence in multiprocessors

sfence.vma only affects the address-translation hardware for the hart that executed the instruction. When a hart changes a page table that another hart is using, the first hart must use an interprocessor interrupt to inform the second hart that it should execute an sfence.vma instruction. This procedure is often referred to as a TLB shootdown.

The RISC V Reader Notes - yszheda/wiki GitHub Wiki

Chap 1. Why RISC-V?

1.2 Modular vs. Incremental ISAs

1.3 ISA Design 101

Chap 2. RV32I: RISC-V Base Integer ISA

2.2 RV32I Instruction formats

2.3 RV32I Registers

What’s Different?

2.4 RV32I Integer Computation

What’s Different?

2.5 RV32I Loads and Stores

What’s Different?

2.6 RV32I Conditional Branch

What’s Different?

Elaboration: Reading the PC

Elaboration: Software checking of overflow

2.7 RV32I Unconditional Jump

2.8 RV32I Miscellaneous

Chap 3. RISC-V Assembly Language

3.2 Calling convention

3.3 Assembly

3.4 Linker

Elaboration: Linker relaxation

3.5 Static vs. Dynamic Linking

Chap 4. RV32M: Multiply and Divide

Chap 5. RV32F and RV32D: Single- and Double-Precision Floating Point

4.1 Introduction

What’s Different?

Elaboration: mulh and mulhu can check for overflow in multiplication.

Elaboration: It’s also easy to check for divide by zero. (beqz before the divide)

Elaboration: mulhsu is useful for multi-word signed multiplication.

5.2 Floating-Point Registers

What’s Different?

Elaboration: RV32FD allows the rounding mode to be set per instruction. (static rounding)

5.3 Floating-Point Loads, Stores, and Arithmetic

5.5 Miscellaneous Floating-Point Instructions

sign-injection instructions

classify (fclass.s, fclass.d)

Chap 6. RV32A: Atomic Instructions

6.1 Introduction

Elaboration: Memory consistency models

Chap 7. RV32C: Compressed Instructions

Chap 8. RV32V: Vector

8.1 Introduction

8.3 Vector Registers and Dynamic Typing

Elaboration: RV32V can switch context quickly.

8.4 Vector Loads and Stores

8.6 Conditional Execution of Vector Operations

8.10 Concluding Remarks

Chap 9. RV64: 64-bit Address Instructions

Chap 10. RV32/64 Privileged Architecture

10.2 Machine Mode for Simple Embedded Systems

10.3 Machine-Mode Exception Handling

10.4 User Mode and Process Isolation in Embedded Systems

10.5 Supervisor Mode for Modern Operating Systems

10.6 Page-Based Virtual Memory

Elaboration: Address-translation cache coherence in multiprocessors

Elaboration: `mulh` and `mulhu` can check for overflow in multiplication.

Elaboration: It’s also easy to check for divide by zero. (`beqz` before the divide)

Elaboration: `mulhsu` is useful for multi-word signed multiplication.

classify (`fclass.s`, `fclass.d`)