ASM on the ESP8266 - mhightower83/Arduino-ESP8266-misc GitHub Wiki

Fumbling through ASM on the ESP8266

Overview

The ESP8266 uses a subset of the Xtensa instruction set. It has 16 32-bit registers, a0 through a15. Registers a0 and a1 serve a special purpose.

For a0, when the call0 or callx0 instruction is used a0 is set to the next instruction after the call by the hardware. On entry, the called function will often save a0 on the stack. A really short function that does not call other functions might not. The ret instruction will load a0 into the program counter for return. Within a function, a0 can be used like any other register after it is saved on the stack.

By convention on the ESP8266 a1 is used as a stack pointer. Some Xtensa processors are configured with a register spill feature that automatically saves the registers on the stack. For those configurations, register a1 has a strong hardware connection with stack handling. Many times a small function will at a minimum create a 16-byte stack frame (not always). The stack grows from a high to a low address. So you often see a small function start with ADDI.N a1, -16.

Basic ASM Example

An example ASM function template to show how to start and end a Basic ASM function.

click to expand/colapse example

extern "C" size_t example_fn(void *src);
asm(
    ".section     .text.example_fn,\"ax\",@progbits\n\t"
    ".literal_position\n\t"
    ".literal     .ets_strlen, ets_strlen\n\t"
    ".align       4\n\t"
    ".global      example_fn\n\t"
    ".type        example_fn, @function\n\t"
    "\n"
"example_fn:\n\t"
    // Function entry, save some stuff to restore at exit.
    "addi         a1,     a1,     -16\n\t"        // Create Stack Frame
    // Registers a0, a2 through a11 may be clobbered by a call to another function.
    // Callee must restore a1 at return.
    "s32i         a0,     a1,     0\n\t"
    "s32i         a12,    a1,     8\n\t"
    
    // Add your logic here
    ...
    // In this example we saved a12 so we could use register a12 
    // to save function argument one (a2) across other function calls.
    "mov          a12,    a2\n\t"
    ...

    // Restore and exit with result in a2
    "l32i         a12,    a1,     8\n\t"
    "l32i         a0,     a1,     0\n\t"
    "addi         a1,     a1,     16\n\t"
    "ret\n\t"
    ".size example_fn, .-example_fn\n\t"
);

Extended ASM Notes

I find Extended ASM to be tricky. You read the docs you think you understand what they are saying. You try it, only to find out you know nothing about what you are doing. :(

I have found with the newer GNU v10 compiler this has only gotten worse. You really need to unassemble functions with Extended ASM to be sure the compiler did not corrupt the result. Make sure it doesn't trash registers that are needed later.

The compiler only knows what you tell it about your inline assembly. What it knows is conveyed after the : markers.

  • Tell it about the registers you clobber/overwrite.
    • eg. ASM VOLATILE ("movi a2, 0\n\tmovi a3, 0\n\t" ::: "a2", "a3");
    • The example below has no clobber list and lets the compiler pick the scratch register
      • CAUTION, if you make calls from your ASM the registers the compiler has chosen will most likely be overwritten by the function called. If so, don't use this technique.
      • Define a temporary variable and assign it as an output.
      • If you also have input variables, depending on when you use the temporary you may need to add the & qualifier exp. "=&r"(tmp) This indicates that input registers are not available for reuse when this output register is used.
  uint32_t tmp;  // Let the compiler select the optimum scratch register
  asm volatile(
    "movi.n           %0,   0\n\t"
    "wsr.dbreakc0     %0\n\t"
    "wsr.ibreakenable %0\n\t"
    "wsr.icount       %0\n\t"
    :"=r"(tmp) ::);
  • When finished with an input register, don't use it as a scratch register. Add a scratch register to the output list instead, as in the above example.

    • I had problems with an HWDT crash when I did and combined it with compiler optimizations O2, O3, or Ofast.
    • And yet (maybe a newer compiler version) I have seen the compiler ignore values previously loaded into input registers and reload the values.
      • If you want a register held value that your ASM only reads to not be reloaded later for "C" code references, use it as an output register (+a) and only read it. This may not always have a minimizing effect on size. If the register pool runs empty, it could force the compiler to save the register value on the stack for later. I am finding it is always a good idea to look at what the compiler did to your diligently written Extended ASM and the surrounding code.
    • Do not modify the contents of input registers. If you need to, move the register to the output list as a read/write register, uint32_t val=42; ASM("... \n\t" : "+&ar"(val) : "r"(val_in):);.
  • When calling a function remember to list the modified scratch registers a0,a2-a11. After the clobbered registers, include "memory" to prevent the optimizer from moving the line. This is often needed when used in a loop without any local variable references.

    • eg ASM VOLATILE ("callx0 %0\n\t" :: "r"(Cache_Read_Disable): "a0", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9", "a10", "a11", "memory");
  • Things get confusing when saving to a global variable or other memory block address:

    • When you need to save to a global variable you are saving to a memory location.
    • I was unable to use an output register to save to a global variable.
    • When the output is a memory location, you use a base register to reference it. That base register is an input. TODO: Needs better thought organization/cleanup.
      • I had to pass the global variable address via an input register instead.
      • Then, use that register as a base register to an s32i.n instruction to save/output.
    • The example below saves the address location of data to the memory location of save_here
    • Note, no output registers used. The output is written to memory and the address of that memory is held in an input register.
    • "=p" I have seen this work as an output register on a single line ASM; however, it quickly fails with more instructions. Somebody likes to insert s32i.n a2, a2, 0 (or some other combination of uninitialized registers) which is not very useful and could be very hard to debug! Staying with memory store base addresses passed via the input list for now.
void *save_here = NULL;
uint32_t data = 0xaa55;

__attribute__((noinline)) 
void save_to_pointer(void) {
  // This results in two input registers loaded with 
  // &save_here and &data from the literals table.
  //
  // "I", 12 bit signed int for movi
  // "I", For l32i or s32i it is an unsigned 10 bit value.
  // Always in multiples of 4. The instruction encoding will
  // convert it to an 8 bits.
  constexpr int off = 0; // range 0 - 1020
  asm volatile (
      "s32i %0, %1, %2\n\t"
      : // No output register constraints appear to work
      : "a"(&data), "a"(&save_here), "I"(off)
      :);
  // 4020121c:    fffe21     l32r    a2, 40201214
  // 4020121f:    fffe31     l32r    a3, 40201218
  // 40201222:    0239       s32i.n  a2, a3, 0

}

A richer example of extended asm:

struct DBREAK_S {
  const void *a = NULL;
  uint32_t c = 0;
};

inline 
struct DBREAK_S setDataBreakpoint(struct DBREAK_S dbreak) {
  uint32_t tmp;
  constexpr size_t intlevel = 15u;
  asm volatile(
    "memw\n\t"    // My thinking - ensure that pipeline data that breaks has been processed.
    "excw\n\t"
    "rsil           %[old_ps],     %[new_intlevel]\n\t"
    "xsr.dbreaka0   %[addr]\n\t"        // 144 == DBREAKA  // dsync
    "xsr.dbreakc0   %[RW_Mask]\n\t"     // 160 == DBREAKC  // dsync
    "wsr.ps         %[old_ps]\n\t"
    "rsync\n\t"
    : // outputs
      [addr]"+ar"(dbreak.a),
      [RW_Mask]"+ar"(dbreak.c),
      [old_ps]"=&ar"(tmp)        // constraint `&` is not needed in this example
    : // inputs
      [new_intlevel]"i"(intlevel)
    :
  );
  return dbreak;
}
  • Input register reuse is the default. To prevent reuse use the & constraint. This stack exchange explains it a bit: Review and understand & If your understanding is still not clear just use the & constraint for now and re-read the discussion later or look for other discussions. It is an important constraint to understand. Register reuse can also occur with just a single line of assembly, which may not always be what you want. It is preferable to not just always use the & output constraint, doing so may reduce register usage efficiency.

There may be nuances to the qualifier volatile that I don't yet recognize. My examples here, most likely are using it when it is not needed. It disables some optimizations around the Extended ASM follow the link for specifics.

  • Observed, compiler moved code that was not in a loop. __volatile__ prevented the move. __asm__ ("rsr.excvaddr %0;" :"=r"(excvaddr)::); The code only had an output parameter. I don't know if an input would have changed the outcome.

The end of an assembly instruction can be terminated with a ;, \n, or \n\t; however, when viewing the .s file created, it is easier to read if you use \n\t. The result is each line of the assembly will be on a new line and tabbed over. With ; it all appears on one line.

Simple Constraints

Subset from 6.47.3.1 tools/xtensa-lx106-elf/share/info/gcc.info

  • r A register operand is allowed provided that it is in a general register.
  • i An immediate integer operand (one with constant value) is allowed. This includes symbolic constants whose values will be known only at assembly time or later.
  • n An immediate integer operand with a known numeric value is allowed. Many systems cannot support assembly-time constants for operands less than a word wide. Constraints for these operands should use 'n' rather than 'i'.

Constraints from Xtensa—config/xtensa/constraints.md

Register constraints

  • a General-purpose 32-bit register
    • Range a0, a2..a15
    • a1 is the stack pointer and is excluded.
  • b One-bit boolean register
    • Not available on the ESP8266
    • #define XCHAL_HAVE_BOOLEANS 0 /* boolean registers */ in xtensa/config/core-isa.h
  • A MAC16 40-bit accumulator register
    • Not available on the ESP8266
    • #define XCHAL_HAVE_MAC16 0 /* MAC16 package */ in xtensa/config/core-isa.h

Integer constant constraints

  • I Signed 12-bit integer constant, for use in MOVI instructions
    • Range -2048..2047
  • I Signed 12-bit integer constant, accepted by L32I and S32I instructions for byte offset field
    • Constraints i and n also work for this
    • Range 0..1020
  • J A signed 8-bit integer constant, for use in ADDI instructions.
    • Range -128..127
    • Range -1, 1..15 for ADDI.N
  • K Integer constant valid for BccI instructions
    • i.e. -1, 1..8, 10, 12, 16, 32, 64, 128, 256
  • L Unsigned constant valid for BccUI instructions
    • i.e. 32768, 65536, 2..8, 10, 12, 16, 32, 64, 128, 256

The rest of these do not appear in tools/xtensa-lx106-elf/share/info/gccint.info; however, they appear in other references on the Internet for Xtensa processors. My observations are listed below. I am now thinking it might be best to use i and n instead, for identifying general integers. This appears to have the effect that when values are too large the assembler reports the issue instead of the compiler.

  • M An integer constant for use with MOVI.N instructions.
    • Range -32..95
  • N An unsigned 8-bit integer constant shifted left by 8 bits for use with ADDMI instructions.
  • O An integer constant that can be used in ADDI.N instructions.
    • Range -1, 1..15
  • P An integer constant that can be used as a mask value in an EXTUI instruction.
  • Y A constant that can be used in relaxed MOVI instructions.
    • The cryptic level is high with this one.
    • I assume they refer to a MOVI that is changed to an L32R and add the constant to the literal area. Doesn't seem to work or I don't know what they are talking about. Most likely the latter.
    • Constraints i or n work fine, for the case of a MOVI with immediate values that are too large. The assembler will change the instruction to an L32R, and add the constant to the literal area.

Memory constraints

No idea how to get these to work and there is this comment that I don't follow:

Do not use define_memory_constraint here. Doing so causes reload to force some constants into the constant pool, but since the Xtensa constant pool can only be accessed with L32R instructions, it is always better to just copy a constant into a register. Instead, use regular constraints but add a check to allow pseudos during reload.

  • R Memory that can be accessed with a 4-bit unsigned offset from a register.
  • T Memory in a literal pool (addressable with an L32R instruction)
  • U Memory that is not in a literal pool.

Register Usage

The ESP8266 uses Call0 ABI. For a more complete description see "8.1.2 CALL0 Register Usage and Stack Layout" in Xtensa® Instruction Set Architecture (ISA) Reference Manual. Call0 ABI does not make use of register windows, relying instead on a fixed set of 16 registers without window rotation. Summary of Call0 ABI registry usage:

  • a0 - return address
  • a1 - stack pointer (alias sp)
  • a2 - first argument and result of a call (in simple cases)
  • a3-a7 - second through sixth arguments of a call (in simple cases).
  • a8 - scratch register or Static Chain, when more than 6 arguments are passed on the stack. See Section 8.1.8 of Xtensa® Instruction Set Architecture (ISA) Reference Manual.
  • a9-a11 - scratch.
  • a12-a15 - callee-save (a function must preserve these for its caller).

I am having trouble parsing this:

On a FreeRTOS API call, callee-save registers are saved only when a task context switch occurs, and other registers are not saved at all (the caller does not expect them to be preserved). On an interrupt, callee-saved registers might only be saved and restored when a task context-switch occurs, but all other registers are always saved and restored. This is mostly taken from some FreeTOS documentation found here.

I find I am not believing what I think they are saying. From what I have seen with the gnu compiler output, I think this is the situation:

  1. ISR must save/restore all registers they use.
  2. Functions calling other functions or NON-OS APIs have the expectation that the callee-saved registers will be preserved.
  3. In a function call, registers a2-a11 are not preserved. If registers a0 and/or a1 are altered they must be restored before return.

ESP8266 Exceptions and Interrupts

Summary of some of the ESP8266 features:

  • XEA2, not XEA1.
  • Hardware Interrupts at level 1
  • Debug exception at level 2, in other words, some breakpoint support.
  • NMI support at level 3
  • Follow __XTENSA_CALL0_ABI__ defines. There is no support for register windows.

A DoubleException can occur when already in an exception and a new one is generated. For example syscall. Ref 4.4.1.2 Exception Causes under the Exception Option

My interpretation of what I read is that EXCM when set, will block all interrupts until cleared. When EXCM is 0, events above INTLEVEL can generate interrupts/exceptions. The description presented at 4.4.5.4 Checking for Interrupts, indicates that EXCM will get set and INTLEVEL will rise to match the current interrupt event, thus blocking any future interrupts from that level while processing the current event.

References:

⚠️ **GitHub.com Fallback** ⚠️