Suspend Resume - betrusted-io/betrusted-wiki GitHub Wiki

Supporting Suspend Resume

The main SRAM for Precursor is battery-backed, and is persistent even when the SoC is fully powered down.

Therefore, not all servers need to be aware of suspend/resume: only servers which store state in volatile hardware registers or RAM require coordination. Thus, most "application-level" servers will be unaware of suspend/resume, as their state is stored in non-volatile memory.

Hardware servers which need to store volatile state (e.g. CSR configurations) will need to do the following:

  • Most hardware-facing servers have an implementation block which allows for two different views of the server, one for actual hardware, and one for emulation ("hosted mode"). The implementation block which touches the hardware CSRs will need to allocate a RegManager structure, which is the backing storage for the hardware CSRs. The structure takes a parameter that specifies the number of registers or bitfields that need to be backed up. As a rough initial guess, a _NUMREGS constant is generated by the UTRA which is the number of registers in the CSR (don't forget to wrap it in curly braces so Rust knows it's a const_generic and not a type). Once the code is stabilized, this can be trimmed down. If too few entries are allocated, the manager will respond with a panic when trying to push the register that can't fit.
  • All CSRs that need backing should be added to the RegManager structure using a .push() method. The .push() method takes as an argument the Register or Field you wish to store.
  • A pair of methods should be added to the implementation block that correspond to a suspend or a resume operation. In the simplest case, these methods simply call the .suspend() or .resume() auto-traits on the RegManager structure.
  • An Opcode should be added to the server's API for a Suspend callback. By convention, we name the Opcode SuspendResume
  • In the initialization routine, the server will allocate a Susres object, and call .hook_suspend_callback() on the object, handing it a local connection CID for the incoming SuspendResume message
  • In the main loop, the SuspendResume opcode should be handled. This is the general form of the handler:
            Some(api::Opcode::SuspendResume) => xous::msg_scalar_unpack!(msg, token, _, _, _, {
                implementation.suspend();
                susres.suspend_until_resume(token).expect("couldn't execute suspend/resume");
                implementation.resume();
            }),

That's it. From the server's standpoint, when a SuspendResume message comes in, it comes with a token which is used to help the suspend/resume manager tally who is ready for the suspend operation. The suspend_until_resume() call looks like it does nothing, but in fact it blocks execution until the system powers down. On power-up, the system resumes execution within the routine, and then returns back to the hardware server, which is why the .resume() call is the next method to be invoked.

Internal Process Overview

It's assumed that there is a process called the susres server which coordinates suspend/resume.

The process has a thread that runs an execution_gate server. This is a unique SID whose sole purpose is to receive blocking scalars, and block until a resume happens. The "resume" state is coordinated by some AtomicBool value within the susres server.

The susres server also owns a software interrupt. The interrupt handler has a structure like this:

fn susres_handler() {
   if !resume_register() {
      shutdown_system();
      loop {}
   } else {
      try_send_message(sus_main_cid, ResumeMessage);
   }
}

The ticktimer is augmented to split out some CSRs to a different virtual memory page, so that the susres server can manage the ticktimer state directly. This ensures that system time is precisely kept between suspend and resume operations. A strong assumption was made about the monotonicity of the ticktimer (it's a 64-bit milliseconds counter, so in practice it will never roll over), and so by allowing susres to manage the ticktimer, we can get fine-grained accuracy on the ticktimer state without having to do funny games with inter-process thread scheduling.

The "clean suspend" marker is a page in RAM should contain at a minimum:

  • A random nonce
  • The BtSeed of the FPGA
  • A hash of the above

The purpose of this marker is to make sure we don't try to resume from a "random" state of RAM. The potential failure mode we'd like to avoid is that we had a partial power-down, such that RAM state had decayed, but not sufficiently so to cause the hash check to fail. Thus for the "random nonce" perhaps we should fill most of the page with random data, under the theory that it becomes increasingly unlikely that we miss an actual power outage event as more bits are included in the hash check.

Suspend process

  1. During the first boot, a copy of the kernel's computed arguments are kept in the loader page, and the loader page is mapped as used, preventing it from being overwritten in Xous.
  2. A Suspend request is sent to the susres server
  3. Callbacks to all suspend subscribers to prep for suspend
  4. Suspend subscribers handle the suspend request per their own implementation, but at a minimum they all guarantee this behavior:
  • store hardware registers
  • send a SuspendReady scalar to the susres server indicating it is ready for being suspended
  • send a blocking scalar SuspendingNow to the susres execution_gate server
  • the subscriber thread that owns the CSR page does not modify registers after the SuspendReady message is sent (this may include disabling any interrupt handlers by setting EV_EVENT to 0)
  1. susres server waits until the sooner of all suspend subscribers indicating SuspendReady or a timeout.
  2. record if any hardware did not successfully suspend within the timeout
  3. set the "clean suspend" marker in RAM. Note that this record should be derived from BtSeed as well, so we can catch if the FPGA image has updated during the suspend (in which case we should have a clean boot).
  4. note the PID of the susres server in the "clean suspend" marker, so we know which process to resume into.
  5. susres server ensures the resume bit is cleared, and trips the interrupt to execute the susres_handler() noted above. This causes the kernel to save the last thread context and shut the system down somewhere inside the interrupt handler.

Resume process

  1. Power on, parse kernel args to figure out how big RAM is, etc.
  2. Check if clean suspend marker; if no, do cold boot. Either way, zero out the marker. If yes, extract the susres PID.
  3. Re-initialize kernel peripherals (e.g. TRNG)
  4. Note if we had a clean suspend in the susres register, setup the resume interrupt, and trigger it (but interrupts are still masked -- we will handle this later)
  5. Flip the bit on the "resume" hardware susres register; ensure the ticktimer is paused so the user-space code can reload it right away.
  6. Reload the backup kernel arguments; patch the PID of susres server into SATP.
  7. Boot into the kernel with the resume argument set; the asm.S post-amble for the loader contains the code that sets up the SATP and brings us into virtual memory mode.
  8. A separate asm.S pre-amble for the kernel checks the resume argument; if true, it sets up the system as if it were entering an interrupt context in the susres handler by setting the default stack pointer, enabling interrupts, and setting scause so an interrupt appears to be triggered.
  9. Jump to the interrupt dispatch routine in Xous, e.g. _start_trap_rust
  10. Xous enters the susres resume interrupt handler, but with the resume hardware bit set, causing it to pick the resume path.
  11. The susres server gets the resume message, sets the AtomicBool that gates the execution_gate to ungate its execution
  12. All the blocking scalars from the SuspendReady call are unblocked. This allows thread execution to resume, at which point servers restore their hardware registers.
  13. Execution resumes as normal.

Low-Level Notes on Resume Preparation

This is the meat of the code that enables the MMU, given that the tables have been set up already: https://github.com/betrusted-io/xous-core/blob/main/loader/src/asm.S#L26-L54

With RISC-V, there are three modes: Machine, Supervisor, and User. Machine mode is always physical, Supervisor and User depend on the state of a bit in... I want to say MSTATUS. So what you have to do is set MSTATUS such that when you return from an interrupt, it goes into Supervisor mode. You set the return-from-interrupt address to be the address of main. When that happens, the MMU gets enabled and you enter Supervisor mode.

This is the loader-to-kernel jump point: https://github.com/betrusted-io/xous-core/blob/main/loader/src/main.rs#L1095-L1110

VexRISCV CPU registers

... and who is responsible for saving and restoring them.

Architectural

Machine CSR

Let's work back from the source code to figure out what we can affect in a "restore" context. Here is a map of writable CSRs on the VexRiscV (as read out of the source code / these aren't analyzed yet):

Writable CSRs on the vexriscv

These are not used by Xous, because Xous does not use machine mode:

The "sstatus" registers are maintained by the kernel, and do not need an explicit "restore":

  • 0x300: mstatus/sstatus/status
  • 0x100: sstatus/status

These need to be restored by the loader, prior to Xous resume:

  • 0x180: satp. This needs to have a PID set to the susres PID, as that's the process we are resuming into. The rest does not need to be touched, as the kernel occupies a megapage that sits at the top of every process (so any PID will have valid kernel pages at the right spot).
  • 0x9c0: zz_258 -> masks externalAinterruptArray_regNext for supervisor mode -> SIM. The value is stored in SIM_BACKING in kernel/src/arch/riscv/irq.rs.
  • 0x344, 0x144 (true alias): sip - read prior to handling an interrupt. Does not seem to side-effect clearing the bit. No explicit restore is required, just need to trigger the resume software interrupt and the normal mechanisms should "do the right thing".
  • 0x104: sie - static values loaded in
  • In addition, the loader should block until the TRNG kernel port shows availability of data; then the first entry should be read and discarded as it is an invalid pipeline value.
  • 0x105: stvec. This was not found in the code review, but this also needs to be setup properly. This is statically mapped to _start_trap, and is necessary for the interrupt handler to return as it returns by triggering an "instruction page fault" (e.g. returning to a known "bad" instruction page, and using that mechanism as the dedicated return-from-interrupt handler)

Peripherals that could delay suspend

This is an ad-hoc list of things that keep me up at night when I think about suspend/resume.

  • I2C - split transactions in progress
  • Audio - currently playing audio buffer
  • Engine25519 - currently computing, plus microcode/computation state
  • SHA - currently computing, plus digest state
  • AES - CPU AES registers?
  • Memlcd - redraw in progress
  • SPINOR - erase/program in progress
  • JTAG - eFuse operation in progress