How to debug a HardFault on an ARM Cortex M MCU - JohnHau/mis GitHub Wiki

https://interrupt.memfault.com/blog/cortex-m-fault-debug

Faults happen on embedded devices all the time for a variety of reasons – ranging from something as simple as a NULL pointer dereference to something more unexpected like running a faulty code path only when in a zero-g environment on the Tower of Terror in Disneyland1. It’s important for any embedded engineer to understand how to debug and resolve this class of issue quickly.

In this article, we explain how to debug faults on ARM Cortex-M based devices. In the process, we learn about fault registers, how to automate fault analysis, and figure out ways to recover from some faults without rebooting the MCU. We include practical examples, with a step by step walk-through on how to investigate them.

If you’d rather listen to me present this information and see some demos in action, watch this webinar recording.

Like Interrupt? Subscribe to get our latest posts straight to your mailbox.

Table of Contents Determining What Caused The Fault Relevant Status Registers Configurable Fault Status Registers (CFSR) - 0xE000ED28 HardFault Status Register (HFSR) - 0xE000ED2C Recovering the Call Stack Automating the Analysis Halting & Determining Core Register State Fault Register Analyzers Postmortem Analysis Recovering From A Fault Examples eXecute Never Fault Bad Address Read Coprocessor Fault Imprecise Fault Fault Entry Exception Recovering from a UsageFault without a SYSRESET Determining What Caused The Fault All MCUs in the Cortex-M series have several different pieces of state which can be analyzed when a fault takes place to trace down what went wrong.

First we will explore the dedicated fault status registers that are present on all Cortex-M MCUs except the Cortex-M0.

If you are trying to debug a Cortex-M0, you can skip ahead to the next section where we discuss how to recover the core register state and instruction being executed at the time of the exception.

NOTE: If you already know the state to inspect when a fault occurs, you may want to skip ahead to the section about how to automate the analysis.

image

image

image

image

image

image

image

image

image

image

image

image

Faults from Faults! The astute observer might wonder what happens when a new fault occurs in the code dealing with a fault. If you have enabled configurable fault handlers (i.e MemManage, BusFault, or UsageFault), a fault generated in these handlers will trigger a HardFault.

Once in the HardFault Handler, the ARM Core is operating at a non-configurable priority level, -1. At this level or above, a fault will put the processor in an unrecoverable state where a reset is expected. This state is known as Lockup.

Typically, the processor will automatically reset upon entering lockup but this is not a requirement per the specification. For example, you may have to enable a hardware watchdog for a reset to take place. It’s worth double checking the reference manual for the MCU being used for clarification.

When a debugger is attached, lockup often has a different behavior. For example, on the NRF52840, “Reset from CPU lockup is disabled if the device is in debug interface mode”5.

When a lockup happens, the processor will repeatedly fetch the same fixed instruction, 0xFFFFFFFE or the instruction which triggered the lockup, in a loop until a reset occurs.

Fun Fact: Whether or not some classes of MemManage or BusFaults trigger a fault from an exception is actually configurable via the MPU_CTRL.HFNMIENA & CCR.BFHFNMIGN register fields, respectively.

Automating the Analysis At this point we have gone over all the pieces of information which can be manually examined to determine what caused a fault. While this might be fun the first couple times, it can become a tiresome and error prone process if you wind up doing it often. In the following sections we’ll explore how we can automate this analysis!

image

image

image

image

image

image

When you next start gdb, you can source the svd_gdb.py script and use it to start inspecting registers. Here’s some output for the svd plugin we will use in the examples below:

image

Postmortem Analysis The previous two approaches are only helpful if we have a debug or physical connection to the device. Once the product has shipped and is out in the field these strategies will not help to triage what went wrong on devices.

One approach is to simply try and reproduce the issue on site. This is a guessing game (are you actually reproducing the same issue the customer hit?), can be a huge time sink and in some cases is not even particularly feasible1.

Another strategy is to log the fault register and stack values to persistent storage and periocially collect or push the error logs. On the server side, the register values can be decoded and addresses can be symbolicated to try to root cause the crash.

Alternatively, an end-to-end firmware error analysis system, such as Memfault, can be used to automatically collect, transport, deduplicate and surface the faults and crashes happening in the field. Here is some example output from Memfault for the bad memory read example we will walk through below:

image

image

image

image

Follow the instructions above to setup support for reading SVD files from GDB, build, and flash the example app:

$ make [...] Linking library Generated build/nrf52.elf $ arm-none-eabi-gdb-py --eval-command="target remote localhost:2331" --ex="mon reset" --ex="load" --ex="mon reset" --se=build/nrf52.elf $ source PyCortexMDebug/cmdebug/svd_gdb.py $ (gdb) svd_load cortex-m4-scb.svd Loading SVD file cortex-m4-scb.svd... (gdb)

image

eXecute Never Fault Code int illegal_instruction_execution(void) { int (*bad_instruction)(void) = (void *)0xE0000000; return bad_instruction(); } Analysis (gdb) break main (gdb) continue Breakpoint 1, main () at ./cortex-m-fault-debug/main.c:180 180 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) set g_crash_config=0 (gdb) c Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap. 0x00000218 in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-m-fault-debug/startup.c:91 91 HALT_IF_DEBUGGING(); (gdb) bt #0 0x00000218 in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-m-fault-debug/startup.c:91 #1 #2 0x00001468 in prvPortStartFirstTask () at ./cortex-m-fault-debug/freertos_kernel/portable/GCC/ARM_CM4F/port.c:267 #3 0x000016e6 in xPortStartScheduler () at ./cortex-m-fault-debug/freertos_kernel/portable/GCC/ARM_CM4F/port.c:379 #4 0x1058e476 in ?? () We can check the CFSR to see if there is any information about the fault which occurred.

(gdb) p/x (uint32_t)0xE000ED28 $3 = 0x1 (gdb) svd SCB CFSR_UFSR_BFSR_MMFSR Fields in SCB CFSR_UFSR_BFSR_MMFSR: IACCVIOL: 1 Instruction access violation flag [...] That’s interesting! We hit a Memory Management instruction access violation fault even though we haven’t enabled any MPU regions. From the CFSR, we know that the stacked frame is valid so we can take a look at that to see what it reveals:

(gdb) p/a *frame $1 = { r0 = 0x0 <g_pfnVectors>, r1 = 0x200003c4 <ucHeap+604>, r2 = 0x10000000, r3 = 0xe0000000, r12 = 0x200001b8 <ucHeap+80>, lr = 0x195 <prvQueuePingTask+52>, return_address = 0xe0000000, xpsr = 0x80000000 } We can clearly see that the executing instruction was 0xe0000000 and that the calling function was prvQueuePingTask.

From the ARMv7-M reference manual15 we find:

The MPU is restricted in how it can change the default memory map attributes associated with System space, that is, for addresses 0xE0000000 and higher. System space is always marked as XN, Execute Never.

So the fault registers didn’t lie to us, and it does make sense that we hit a memory management fault!

Bad Address Read Code uint32_t read_from_bad_address(void) { return *(volatile uint32_t *)0xbadcafe; } Analysis (gdb) break main (gdb) continue Breakpoint 1, main () at ./cortex-m-fault-debug/main.c:189 189 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) set g_crash_config=1 (gdb) c Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap. 0x00000218 in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-m-fault-debug/startup.c:91 91 HALT_IF_DEBUGGING();

Again, let’s take a look at the CFSR and see if it tells us anything useful.

(gdb) p/x (uint32_t)0xE000ED28 $13 = 0x8200 (gdb) svd SCB CFSR_UFSR_BFSR_MMFSR Fields in SCB CFSR_UFSR_BFSR_MMFSR: [...] PRECISERR: 1 Precise data bus error [...] BFARVALID: 1 Bus Fault Address Register (BFAR) valid flag Great, we have a precise bus fault which means the return address in the stack frame holds the instruction which triggered the fault and that we can read BFAR to determine what memory access triggered the fault!

(gdb) svd/x SCB BFAR Fields in SCB BFAR: BFAR: 0x0BADCAFE Bus fault address

(gdb) p/a *frame $16 = { r0 = 0x1 <g_pfnVectors+1>, r1 = 0x200003c4 <ucHeap+604>, r2 = 0x10000000, r3 = 0xbadcafe, r12 = 0x200001b8 <ucHeap+80>, lr = 0x195 <prvQueuePingTask+52>, return_address = 0x13a <trigger_crash+22>, xpsr = 0x81000000 }

(gdb) info line *0x13a Line 123 of "./cortex-m-fault-debug/main.c" starts at address 0x138 <trigger_crash+20> and ends at 0x13e <trigger_crash+26>.

(gdb) list *0x13a 0x13a is in trigger_crash (./cortex-m-fault-debug/main.c:123). 118 switch (crash_id) { 119 case 0: 120 illegal_instruction_execution(); 121 break; 122 case 1: ===> FAULT HERE 123 read_from_bad_address(); 124 break; 125 case 2: 126 access_disabled_coprocessor(); 127 break; Great, so we have pinpointed the exact code which triggered the issue and can now fix it!

Coprocessor Fault Code void access_disabled_coprocessor(void) { // FreeRTOS will automatically enable the FPU co-processor. // Let's disable it for the purposes of this example __asm volatile( "ldr r0, =0xE000ED88 \n" "mov r1, #0 \n" "str r1, [r0] \n" "dsb \n" "vmov r0, s0 \n" ); } Analysis (gdb) break main (gdb) continue Breakpoint 4, main () at ./cortex-m-fault-debug/main.c:180 180 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) set g_crash_config=2 (gdb) c Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap. 0x00000218 in my_fault_handler_c (frame=0x20002d80) at ./cortex-m-fault-debug/startup.c:91 91 HALT_IF_DEBUGGING(); We can inspect CFSR to get a clue about the crash which took place

(gdb) p/x (uint32_t)0xE000ED28 $13 = 0x8200 (gdb) svd SCB CFSR_UFSR_BFSR_MMFSR Fields in SCB CFSR_UFSR_BFSR_MMFSR: [...] NOCP: 1 No coprocessor usage fault. [...] We see it was a coprocessor UsageFault which tells us we either issued an instruction to a non-existent or disabled Cortex-M coprocessor. We know the frame contents are valid so we can inspect that to figure out where the fault originated:

(gdb) p/a *frame $27 = { r0 = 0xe000ed88, r1 = 0x0 <g_pfnVectors>, r2 = 0x10000000, r3 = 0x0 <g_pfnVectors>, r12 = 0x200001b8 <ucHeap+80>, lr = 0x199 <prvQueuePingTask+52>, return_address = 0x114 <access_disabled_coprocessor+12>, xpsr = 0x81000000 }

(gdb) disassemble 0x114 Dump of assembler code for function access_disabled_coprocessor: 0x00000108 <+0>: ldr r0, [pc, #16] ; (0x11c) 0x0000010a <+2>: mov.w r1, #0 0x0000010e <+6>: str r1, [r0, #0] 0x00000110 <+8>: dsb sy ===> FAULT HERE on a Floating Point instruction 0x00000114 <+12>: vmov r0, s0 0x00000118 <+16>: bx lr vmov is a floating point instruction so we now know what coprocessor the NOCP was caused by. The FPU is enabled using bits 20-23 of the CPACR register located at 0xE000ED88. A value of 0 indicates the extension is disabled. Let’s check it:

(gdb) p/x ((uint32_t)0xE000ED88 >> 20) & 0xf $29 = 0x0 We can clearly see the FP Extension is disabled. We will have to enable the FPU to fix our bug.

Imprecise Fault Code void bad_addr_double_word_write(void) { volatile uint64_t *buf = (volatile uint64_t *)0x30000000; *buf = 0x1122334455667788; } Analysis (gdb) break main (gdb) continue Breakpoint 4, main () at ./cortex-m-fault-debug/main.c:182 182 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) set g_crash_config=3 (gdb) c Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap. 0x0000021c in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-m-fault-debug/startup.c:91 91 HALT_IF_DEBUGGING(); Let’s inspect CFSR:

(gdb) p/x (uint32_t)0xE000ED28 $31 = 0x400 (gdb) svd SCB CFSR_UFSR_BFSR_MMFSR Fields in SCB CFSR_UFSR_BFSR_MMFSR: [...] IMPRECISERR: 1 Imprecise data bus error [...] Yikes, the error is imprecise. This means the stack frame will point to the general area where the fault occurred but not the exact instruction!

(gdb) p/a *frame $32 = { r0 = 0x55667788, r1 = 0x11223344, r2 = 0x10000000, r3 = 0x30000000, r12 = 0x200001b8 <ucHeap+80>, lr = 0x199 <prvQueuePingTask+52>, return_address = 0x198 <prvQueuePingTask+52>, xpsr = 0x81000000 } (gdb) list *0x198 0x198 is in prvQueuePingTask (./cortex-m-fault-debug/main.c:162). 157 158 while (1) { 159 vTaskDelayUntil(&xNextWakeTime, mainQUEUE_SEND_FREQUENCY_MS); 160 xQueueSend(xQueue, &ulValueToSend, 0U); 161 ==> Crash somewhere around here 162 trigger_crash(g_crash_config); 163 } 164 } 165 166 static void prvQueuePongTask(void *pvParameters) { Analysis after making the Imprecise Error Precise If the crash was not readily reproducible we would have to inspect the code around this region and hypothesize what looks suspicious. However, recall that there is a trick we can use for the Cortex-M4 to make all memory stores precise. Let’s enable that and re-examine:

(gdb) mon reset Resetting target (gdb) c Continuing.

Breakpoint 4, main () at ./cortex-m-fault-debug/main.c:182 182 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) set g_crash_config=3

==> Make all memory stores precise at the cost of performance ==> by setting DISDEFWBUF in the Cortex M3/M4 ACTLR reg (gdb) set (uint32_t)0xE000E008=((uint32_t)0xE000E008 | 1<<1)

(gdb) c Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap. 0x0000021c in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-m-fault-debug/startup.c:91 91 HALT_IF_DEBUGGING(); (gdb) p/a *frame $33 = { r0 = 0x55667788, r1 = 0x11223344, r2 = 0x10000000, r3 = 0x30000000, r12 = 0x200001b8 <ucHeap+80>, lr = 0x199 <prvQueuePingTask+52>, return_address = 0xfa <bad_addr_double_word_write+10>, xpsr = 0x81000000 } (gdb) list *0xfa 0xfa is in bad_addr_double_word_write (./cortex-m-fault-debug/main.c:92). 90 void bad_addr_double_word_write(void) { 91 volatile uint64_t *buf = (volatile uint64_t *)0x30000000; ==> FAULT HERE 92 *buf = 0x1122334455667788; 93 } (gdb) Awesome, that saved us some time … we were able to determine the exact line that caused the crash!

Fault Entry Exception Code void stkerr_from_psp(void) { extern uint32_t _start_of_ram[]; uint8_t dummy_variable; const size_t distance_to_ram_bottom = (uint32_t)&dummy_variable - (uint32_t)_start_of_ram; volatile uint8_t big_buf[distance_to_ram_bottom - 8]; for (size_t i = 0; i < sizeof(big_buf); i++) { big_buf[i] = i; }

trigger_irq(); } Analysis (gdb) break main (gdb) continue Breakpoint 4, main () at ./cortex-m-fault-debug/main.c:182 182 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) set g_crash_config=4 (gdb) c Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap. 0x0000021c in my_fault_handler_c (frame=0x1fffffe0) at ./cortex-m-fault-debug/startup.c:91 91 HALT_IF_DEBUGGING(); Let’s take a look at CFSR again to get a clue about what happened:

(gdb) p/x (uint32_t)0xE000ED28 $39 = 0x1000 (gdb) svd SCB CFSR_UFSR_BFSR_MMFSR Fields in SCB CFSR_UFSR_BFSR_MMFSR: [...] STKERR: 1 Bus fault on stacking for exception entry Debug Tips when dealing with a STKERR There are two really important things to note when a stacking exception occurs:

The stack pointer will always reflect the correct adjusted position as if the hardware successfully stacked the registers. This means you can find the stack pointer prior to exception entry by adding the adjustment value. Depending on what access triggers the exception, the stacked frame may be partially valid. For example, the very last store of the hardware stacking could trigger the fault and all the other stores could be valid. However, the order the hardware pushes register state on the stack is implementation specific. So when inspecting the frame assume the values being looked at may be invalid! Taking this knowledge into account, let’s examine the stack frame:

(gdb) p frame $40 = (sContextStateFrame *) 0x1fffffe0 Interestingly, if we look up the memory map of the NRF5216, we will find that RAM starts at 0x20000000. Our stack pointer location, 0x1fffffe0 is right below that in an undefined memory region. This must be why we faulted! We see that the stack pointer is 32 bytes below RAM, which matches the size of sContextStateFrame. This unfortunately means none of the values stacked will be valid since all stores were issued to a non-existent address space!

We can manually walk up the stack to get some clues:

(gdb) x/a 0x20000000 0x20000000 : 0x3020100 (gdb) 0x20000004 <g_crash_config>: 0x7060504 (gdb) 0x20000008 : 0xb0a0908 (gdb) 0x2000000c <s_buffer>: 0xf0e0d0c (gdb) 0x20000010 <s_buffer+4>: 0x13121110 (gdb) 0x20000014 <s_buffer+8>: 0x17161514 (gdb) 0x20000018 : 0x1b1a1918 (gdb) 0x2000001c : 0x1f1e1d1c (gdb) 0x20000020 : 0x23222120 It looks like the RAM has a pattern of sequentially increasing values and that the RAM addresses map to different variables in our code (i.e pxCurrentTCB). This suggests we overflowed the stack we were using and started to clobber RAM in the system until we ran off the end of RAM!

TIP: To catch this type of failure sooner consider using an MPU Region

Since the crash is reproducible, let’s leverage a watchpoint and see if we can capture the stack corruption in action! Let’s add a watchpoint for any access near the bottom of RAM, 0x2000000c:

(gdb) mon reset (gdb) continue Breakpoint 4, main () at ./cortex-m-fault-debug/main.c:182 182 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) set g_crash_config=4 (gdb) watch (uint32_t)0x2000000c Hardware watchpoint 9: (uint32_t)0x2000000c TIP: Sometimes it will take a couple tries to choose the right RAM range to watch. It’s possible an area of the stack never gets written to and the watchpoint never fires or that the memory address being watched gets updated many many times before the actual failure. In this example, I intentionally opted not to watch 0x20000000 because that is the address of a FreeRTOS variable, uxCriticalNesting which is updated a lot.

Let’s continue and see what happens:

(gdb) continue Hardware watchpoint 9: (uint32_t)0x2000000c

Old value = 0 New value = 12 0x000000c0 in stkerr_from_psp () at ./cortex-m-fault-debug/main.c:68 68 big_buf[i] = i; (gdb) bt #0 0x000000c0 in stkerr_from_psp () at ./cortex-m-fault-debug/main.c:68 #1 0x00000198 in prvQueuePingTask (pvParameters=) at ./cortex-m-fault-debug/main.c:162 #2 0x00001488 in ?? () at ./cortex-m-fault-debug/freertos_kernel/portable/GCC/ARM_CM4F/port.c:703 Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) list *0xc0 0xc0 is in stkerr_from_psp (./cortex-m-fault-debug/main.c:68). 63 extern uint32_t _start_of_ram[]; 64 uint8_t dummy_variable; 65 const size_t distance_to_ram_bottom = (uint32_t)&dummy_variable - (uint32_t)_start_of_ram; 66 volatile uint8_t big_buf[distance_to_ram_bottom - 8]; 67 for (size_t i = 0; i < sizeof(big_buf); i++) { 68 big_buf[i] = i; 69 } 70 71 trigger_irq(); 72 } Great, we’ve found a variable located on the stack big_buf being updated. It must be this function call path which is leading to a stack overflow. We can now inspect the call chain and remove big stack allocations!

Recovering from a UsageFault without a SYSRESET In this example we’ll just step through the code we developed above and confirm we don’t reset when a UsageFault occurs.

Code void unaligned_double_word_read(void) { extern void *g_unaligned_buffer; uint64_t *buf = g_unaligned_buffer; *buf = 0x1122334455667788; } Analysis (gdb) break main (gdb) continue Breakpoint 4, main () at ./cortex-m-fault-debug/main.c:188 188 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) set g_crash_config=5 (gdb) c Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap. 0x00000228 in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-m-fault-debug/startup.c:94 94 HALT_IF_DEBUGGING(); We have entered the breakpoint in the fault handler. We can step over it and confirm we fall through to the recover_from_task_fault function.

(gdb) break recover_from_task_fault Breakpoint 12 at 0x1a8: file ./cortex-m-fault-debug/main.c, line 181. (gdb) n 108 volatile uint32_t *cfsr = (volatile uint32_t *)0xE000ED28; (gdb) c Continuing.

Breakpoint 12, recover_from_task_fault () at ./cortex-m-fault-debug/main.c:181 181 void recover_from_task_fault(void) {

(gdb) list *recover_from_task_fault 0x1a8 is in recover_from_task_fault (./cortex-m-fault-debug/main.c:181). 181 void recover_from_task_fault(void) { 182 while (1) { 183 vTaskDelay(1); 184 } 185 } If we continue from here we will see the system happily keeps running because the thread which was calling the problematic trigger_crash function is now parked in a while loop. The the while loop could be extended in the future to delete and/or restart the FreeRTOS task if we wanted as well.

Closing I hope this post gave you a useful overview of how to debug a HardFault on a Cortex-M MCU and that maybe you even learned something new!

Are there tricks you like to use that I didn’t mention or other topics about faults you’d like to learn more about? Let us know in the discussion area below!

Interested in learning more about debugging HardFaults? Watch this webinar recording..

See anything you'd like to change? Submit a pull request or open an issue at GitHub

References The Tower of Terror: A Bug Mystery ↩ ↩2

See “A4.1.1 ARMv7-M and interworking support” ↩

Segger JTrace & Lauterbach Trace32 are both capable of analyzing the ETM ↩

See “3.3.9 Auxiliary Bus Fault Status Register” ↩ ↩2

See “5.3.6.8 Reset behavior” ↩

MBed OS fault handler ↩

Zephyr ARM fault handler ↩

CMSIS-SVD ↩

CMSIS Software Packs ↩

PyCortexMDebug ↩

See “B1.5.5 Reset behavior” & “B1.4.2 The special-purpose program status registers, xPSR” ↩

nRF52840 Development Kit ↩

JLinkGDBServer ↩

GNU ARM Embedded toolchain for download ↩

See B3.5.1 “Relation of the MPU to the system memory map” ↩

See “4.2.3 Memory map” ↩

Chris Coleman is a founder and CTO at Memfault. Prior to founding Memfault, Chris worked on the embedded software teams at Sun, Pebble, and Fitbit.

⚠️ **GitHub.com Fallback** ⚠️