05. Debugging & Testing - josehu07/hux-kernel GitHub Wiki

Debugging & Testing your kernel code is not as easy as debugging user-level applications. For Hux, we basically have the following debugging methods:

  1. Using printing statements/log files: we have already enabled printf/cprintf/snprintf in the previous chapter
    • Simple and convenient ✭
    • Not feasible when the bug itself is in display/logging module or lower-level code it depends on (or other occasions when printing does not work)
  2. Manual breakpoints with assembly: inserting a hlt instruction into the code
    • If buggy code is after hlt, you see the CPU halting
    • If buggy code is above hlt, you see the bug appearing (or the system crashing)
  3. Execution stack tracing: takes some time to enable, but very useful! ✭
  4. Exception handler (see exceptions ✭)
    • Be sure no bug is in the exception handler! (see double fault and triple fault)
    • This is a higher-level concept - we will do this in later chapters covering interrupts
  5. [External] Using GDB with QEMU: we develop Hux on a hosted environment and Hux runs in an emulator such as QEMU, which means we can possibly launch debuggers on our hosted OS to break and analyze our kernel code
  6. More on the "Troubleshooting" page...

Main References of This Chapter

Scan through them before going forth:

Manual Breakpoints with hlt

Halt before buggy code, CPU halts normally:

// src/kernel.c
void
kernel_main(void)
{
    terminal_init();

    asm volatile ( "hlt" );     /** Position A. */

    printf("*** BUG! ***\n");
}

Result -

Halt after buggy code, the bug re-appears/the system crashes:

// src/kernel.c
void
kernel_main(void)
{
    terminal_init();

    printf("*** BUG! ***\n");

    asm volatile ( "hlt" );     /** Position B. */
}

Result -

Now we know that the bug is somewhere between positions A and B.

Note that inline assembly is notoriously fragile - compilers and optimizers may generate erroneous intrusctions if we abuse inline assembly. We use an inline hlt here only for temporary debugging. Under cases where inline assembly is a must, please double check with examples on the OSDev wiki.

Enabling Execution Stack Tracing

Be sure to refresh your knowledge on x86 stacks by scanning through the "Stack" page and the "Systems V ABI-i386 Calling Convention" section in the previous chapter.

Stack Tracing with %ebp

The main role of enabling stack tracing is the frame pointer register %ebp. We generally follow these steps:

  1. Retrive current EBP value into uint32_t *ebp
    • *ebp is the saved old EBP (i.e., caller's EBP value)
    • *(ebp + 1) is the return address, i.e., the EIP value of the jsr instruction of the caller, which is somewhere within the caller's code section
  2. Somehow resolve the caller's function name by that valid EIP value (we will do this in the next section)
  3. Set ebp = *ebp, repeat

Update our Makefile with -fno-omit-frame-pointer flag:

# Add '-fno-omit-frame-pointer' flag.
C_FLAGS=-c -Wall -Wextra -ffreestanding -O2 -std=gnu99 -Wno-tautological-compare \
        -g -fno-omit-frame-pointer

We need a stop sign at the top most stack frame, otherwise the tracing process will run off stack_hi and go into garbage. Put a NULL EBP value at the top most stack frame to serve as the stop sign, @ src/boot/boot.s:

    /** Setup the kernel stack by setting ESP to our 'stack_hi' symbol. */
    movl $stack_hi, %esp

    ...
    
    /** Set EBP to NULL for stack tracing's use. */
    xor %ebp, %ebp
    
    /**
     * Jump to the 'kernel_main' function. According to i386 calling
     * convention, 'kernel_main' will save EBP (== NULL) to the stack.
     */
    call kernel_main

Add the tracing logic @ src/common/debug.c:

/** Print stack tracing to terminal. */
void
stack_trace()
{
    uint32_t *ebp;

    asm volatile ("movl %%ebp, %0" : "=r" (ebp));
    while (ebp != NULL && (uint32_t) (ebp + 1) < USER_MAX) {
                          // `USER_MAX` is the top of user address space stack,
                          // which will be explained in a later chapter.
        /** Resolve caller's function name here. */
        ebp = (uint32_t *) *ebp;
    }
}

Resolving Function Names

Our next task is to resolve the name of the function which contains a certain jsr instruction, given the address of that jsr instruction. With the help of a symbols map (mapping start address to function name), we search for the biggest address smaller than the jsr instruction's address, and the function name it maps to is what we need.

So, where is the symbols map? We can either read that from the loaded kernel ELF directly, or let the linker generate a separate symbols file and load this file. I take the first approach which is a bit easier. A pointer to the symbols map is stored in the multiboot boot information structure (not the multiboot header). We first locate the multiboot info section, then locate the symbols map, then search the map with given address.

Exmaple kernel code utilizing multiboot1 information can be found on GNU GRUB's official website here. This is a document-intensive part. The first step is to include multiboot-related structures @ src/boot/multiboot.h (explanation in comments):

/**
 * Multiboot1 related structures.
 * See https://www.gnu.org/software/grub/manual/multiboot/multiboot.html
 * See https://www.gnu.org/software/grub/manual/multiboot/html_node/Example-OS-code.html
 */


#define MULTIBOOT_HEADER_MAGIC     0x1BADB002
#define MULTIBOOT_BOOTLOADER_MAGIC 0x2BADB002   /** Should be in %eax. */


/**
 * Multiboot1 header.
 * See https://www.gnu.org/software/grub/manual/multiboot/multiboot.html#Header-layout
 */
struct multiboot_header {
    uint32_t magic;     /* Must be 0x1BADB002. */
    uint32_t flags;     /* Feature flags. */
    uint32_t checksum;  /* The above fields + this one must == 0 mod 2^32. */

    /**
     * More optional fields below. Not used now, so omitted here.
     * ...
     */
};
typedef struct multiboot_header multiboot_header_t;


/**
 * The section header table for ELF format. "These indicate where the section
 * header table from an ELF kernel is, the size of each entry, number of
 * entries, and the string table used as the index of names", stated on GRUB
 * multiboot1 specification.
 * See https://www.gnu.org/software/grub/manual/multiboot/multiboot.html#Boot-information-format
 */
struct multiboot_elf_section_header_table {
    uint32_t num;
    uint32_t size;
    uint32_t addr;
    uint32_t shndx;
};
typedef struct multiboot_elf_section_header_table multiboot_elf_section_header_table_t;


/**
 * Multiboot1 information.
 * See https://www.gnu.org/software/grub/manual/multiboot/multiboot.html#Boot-information-format
 */
struct multiboot_info {
    uint32_t flags;         /* Multiboot info version number. */
    uint32_t mem_lower;     /* Available memory from BIOS. */
    uint32_t mem_upper;
    uint32_t boot_device;   /* The "root" partition. */
    uint32_t cmdline;       /* Kernel command line. */
    uint32_t mods_count;    /* Boot-Module list. */
    uint32_t mods_addr;
    
    /** We use ELF, so not including "a.out" format support. */
    multiboot_elf_section_header_table_t elf_sht;

    /**
     * More fields below. Not used now, so omitted here.
     * ...
     */
};
typedef struct multiboot_info multiboot_info_t;

The next step is ELF-related structures. See this ELF document from CMU. Code @ src/boot/elf.h (again, explanation in comments):

/**
 * ELF 32-bit format related structures.
 * See http://www.cs.cmu.edu/afs/cs/academic/class/15213-f00/docs/elf.pdf
 */


/** ELF section header. */
struct elf_section_header {
    uint32_t name;
    uint32_t type;
    uint32_t flags;
    uint32_t addr;
    uint32_t offset;
    uint32_t size;
    uint32_t link;
    uint32_t info;
    uint32_t addralign;
    uint32_t entsize;
} __attribute__((packed));
typedef struct elf_section_header elf_section_header_t;


/** ELF symbol. */
struct elf_symbol {
    uint32_t name;
    uint32_t value;
    uint32_t size;
    uint8_t  info;
    uint8_t  other;
    uint16_t shndx;
} __attribute__((packed));
typedef struct elf_symbol elf_symbol_t;

/**
 * For getting the type of a symbol table entry. Function type code == 0x2.
 * See https://docs.oracle.com/cd/E23824_01/html/819-0690/chapter6-79797.html#chapter6-tbl-21
 */
#define ELF_SYM_TYPE(info) ((info) & 0xf)
#define ELF_SYM_TYPE_FUNC  0x2

Then, we pass two arguments to the kernel_main function, one is a bootloader magic number (this should reside in %eax at booting, just for double checking that the bootloading is correct), and the other is a pointer to the multiboot information region (this should reside in %ebx at booting). Remember the calling convention: push arguments from last to first.

Update code @ src/boot/boot.s:

    /** Setup the kernel stack by setting ESP to our 'stack_hi' symbol. */
    movl $stack_hi, %esp

    /**
     * Other processor state modifications and runtime supports (such as
     * enabling paging) should go here. Make sure your ESP is still 16 Bytes
     * aligned.
     */
    
    /** Set EBP to NULL for stack tracing's use. */
    xor %ebp, %ebp

    /** Pass two arguments to 'kernel_main', last -> first. */
    pushl %ebx  /** Push pointer to multiboot info. */
    pushl %eax  /** Push bootloader magic number. */
    
    /**
     * Jump to the 'kernel_main' function. According to i386 calling
     * convention, 'kernel_main' will save EBP (== NULL) to the stack.
     * Args are:
     *   4-bytes bootloader_magic
     *   4-bytes multiboot_info_addr
     */
    call kernel_main

Now we have a multiboot info structure pointer at hand after entering kernel_main. Let's build an initialization routine which takes the pointer and pulls the symbols table & strings table out of it, @ src/common/debug.c:

/** Initialized at 'debug_init'. */
static elf_symbol_t *elf_symtab;
static size_t elf_symtab_size;

static const char *elf_strtab;
static size_t elf_strtab_size;


/** Look up an address in symbols map and return its function name. */
static const char *
_lookup_symbol_name(uint32_t addr)
{
    size_t symtab_len = elf_symtab_size / sizeof(elf_symbol_t);

    for (size_t i = 0; i < symtab_len; ++i) {
        if ((ELF_SYM_TYPE(elf_symtab[i].info) == ELF_SYM_TYPE_FUNC)
            && (addr >= elf_symtab[i].value)
            && (addr <= elf_symtab[i].value + elf_symtab[i].size)) {
            return elf_strtab + elf_symtab[i].name;
        }
    }

    return "name_unresolved";
}


/**
 * Pull out the symbols table and strings table from a multiboot information
 * structure pointer.
 */
void
debug_init(multiboot_info_t *mbi)
{
    /** Get the section header table as an array of section headers. */
    elf_section_header_t *sht = (elf_section_header_t *) mbi->elf_sht.addr;
    size_t sht_len = (size_t) mbi->elf_sht.num;

    /**
     * The section header at index 'shndx' in the table is a meta section
     * header. It contains a list of section header names in the table. We
     * should look for section header names ".symtab" & ".strtab".
     */
    const char *sh_names = (const char *) sht[mbi->elf_sht.shndx].addr;

    /** Loop through the table and look for ".symtab" & ".strtab". */
    for (size_t i = 0; i < sht_len; ++i) {
        const char *name = sh_names + sht[i].name;

        if (strcmp(name, ".symtab") == 0) {
            elf_symtab = (elf_symbol_t *) sht[i].addr;
            elf_symtab_size = sht[i].size;
        } else if (strcmp(name, ".strtab") == 0) {
            elf_strtab = (const char *) sht[i].addr;
            elf_strtab_size = sht[i].size;
        }
    }
}

Now, to recursively trace the calling stack, we do:

// src/common/debug.c

/** Print stack tracing to terminal. */
void
stack_trace()
{
    uint32_t *ebp;
    unsigned int id = 0;

    asm volatile ( "movl %%ebp, %0" : "=r" (ebp) );
    while (ebp != NULL) {
        uint32_t addr = *(ebp + 1);
        printf(" %2u) [%p] %s\n", id++, addr, _lookup_symbol_name(addr));
        ebp = (uint32_t *) *ebp;
    }
}

Panicking & Assertions

Panicking is widely used when severe errors occur and we do not know how to recover. We print a panicking message, print the calling stack, and then halt the processor.

Code @ src/common/debug.h:

void debug_init(multiboot_info_t *mbi);

void stack_trace();


/** Panicking macro. */
#define panic(fmt, args...) do {                                               \
                                asm volatile ( "cli" );                        \
                                cprintf(VGA_COLOR_MAGENTA, "PANIC: " fmt "\n", \
                                        ##args);                               \
                                stack_trace();                                 \
                                while (1)                                      \
                                  asm volatile ( "hlt" );                      \
                                __builtin_unreachable();                       \
                            } while (0)

Assertion tests a given condition and panics if the condition is not met. A meaningless panicking message will confuse the developer. In order to print the exact position of the failing assertion statement, we use a macro to implement it. See "Standard Predefined Macros" to learn about what __FUNCTION__, __FILE__, and __LINE__ mean (only for GCC). Also, check this stackoverflow post to learn about why we use do { ... } while (0) in this macro.

We will also use explicit error prompting a lot.

Add these @ src/common/debug.h:

/** Assertion macro. */
#define assert(condition)   do {                                              \
                                if (!(condition)) {                           \
                                    panic("assertion failed @ function '%s'," \
                                          " file '%s': line %d",              \
                                          __FUNCTION__, __FILE__, __LINE__);  \
                                }                                             \
                            } while (0)


/** Error prompting macro. */
#define error(fmt, args...) do {                                           \
                                cprintf(VGA_COLOR_RED, "ERROR: " fmt "\n", \
                                        ##args);                           \
                                panic("error occurred @ function '%s',"    \
                                      " file '%s': line %d",               \
                                      __FUNCTION__, __FILE__, __LINE__);   \
                            } while (0)


/** Warning prompting macro. */
#define warn(fmt, args...)  do {                                               \
                                 cprintf(VGA_COLOR_MAGENTA, "WARN: " fmt "\n", \
                                         ##args);                              \
                            } while (0)


/** Info prompting macro. */
#define info(fmt, args...)  do {                                            \
                                 cprintf(VGA_COLOR_CYAN, "INFO: " fmt "\n", \
                                         ##args);                           \
                            } while (0)

Let's try it out! Clean up and restructure our src/kernel.c as the following:

/** The main function that `boot.s` jumps to. */
void
kernel_main(unsigned long magic, unsigned long addr)
{
    /** Initialize VGA text-mode terminal support. */
    terminal_init();

    /** Double check the multiboot magic number. */
    if (magic != MULTIBOOT_BOOTLOADER_MAGIC) {
        error("invalid bootloader magic: %#x", magic);
        return;
    }

    /** Get pointer to multiboot info. */
    multiboot_info_t *mbi = (multiboot_info_t *) addr;

    /** Initialize debugging utilities. */
    debug_init(mbi);

    assert(0);

    printf("This line should not be displayed!");
}

Result -

Stack Canary Protector

Stack smashing protector is a commonly-used technique to provide some detection and protection on function stack overflowing. To enable this feature through gcc, add the following definitions @ src/common/debug.c:

/**
 * Stack smashing protector (stack canary) support.
 * Build with `-fstack-protector` to enable this. Using a static canary
 * value here to maintain simplicity.
 */
#define STACK_CHK_GUARD 0xCF10A8CB

uintptr_t __stack_chk_guard = STACK_CHK_GUARD;

__attribute__((noreturn))
void
__stack_chk_fail(void)
{
    panic("stack smashing detected");
}

We are using a statically defined random number as the check guard. It is not the safest way but maintains simplicity. For more advanced information, see the "Stack Smashing Protector" page.

Do not forget to add a the -fstack-protector compiler option to our Makefile.

Using GDB with QEMU

Separate Symbols File

We know that we can keep debugging symbols information in the target ELF by compiling with -g flag. However, this will make our executable file bigger. A workaround is to compile with -g, strip out the symbols from hux.bin into a separate symbols file hux.sym using objcopy, and load that symbols file when GDB launches. This keeps the executable small meanwhile enabling debugging symbols.

Update our Makefile with:

TARGET_SYM=hux.sym

OBJCOPY=i686-elf-objcopy
OBJDUMP=i686-elf-objdump

# Add '-g'.
C_FLAGS=-c -Wall -Wextra -ffreestanding -O2 -std=gnu99 -Wno-tautological-compare \
        -g -fno-omit-frame-pointer -fstack-protector


# Remember to link 'libgcc'. Embeds the init process binary.
kernel: $(S_OBJECTS) $(C_OBJECTS)
    @echo
    @echo $(HUX_MSG) "Linking kernel image..."
    $(LD) $(LD_FLAGS) -T scripts/kernel.ld -lgcc -o $(TARGET_BIN) \
        -Wl,--oformat,elf32-i386 $(S_OBJECTS) $(C_OBJECTS)
    $(OBJCOPY) --only-keep-debug $(TARGET_BIN) $(TARGET_SYM)
    $(OBJCOPY) --strip-debug $(TARGET_BIN)


#
# Clean the produced files.
#
.PHONY: clean
clean:
    @echo
    @echo $(HUX_MSG) "Cleaning the build..."
    rm -f $(S_OBJECTS) $(C_OBJECTS) $(TARGET_BIN) $(TARGET_ISO) $(TARGET_SYM)

Now a single make produces hux.bin, hux.iso, as well as hux.sym.

$ file hux.bin
hux.bin: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, not stripped
$ file hux.sym
hux.sym: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, with debug_info, not stripped
$ file hux.iso
hux.iso: DOS/MBR boot sector; GRand Unified Bootloader, ...

QEMU under GDB's Control

Launch QEMU with -S -s will make it pause before executing any code and listen for a GDB connection on port 1234:

# Add to Makefile
.PHONY: qemu_debug
qemu_debug:
    @echo $(HUX_MSG) "Launching QEMU (debug mode)..."
    qemu-system-i386 -vga std -S -s -cdrom $(TARGET_ISO)

# In your shell
$ make qemu_debug

Now, launch GDB, connect to the port that QEMU is listening to (localhost:1234 if using gdb locally; I'm using GDB remotely (from a Vagrant VM), so mine is 10.0.2.2:1234 here), load the symbols file, and start debugging as normal:

$ gdb
(gdb) target remote localhost:1234
(gdb) symbol-file hux.sym

To avoid typing the two commands every time we start GDB, let's put them together into an init script, @ scripts/gdb_init:

target remote localhost:1234    # Mine is `10.0.2.2:1234` here as explained above.
symbol-file hux.sym

Update our Makefile with:

.PHONY: gdb
gdb:
    @echo $(HUX_MSG) "Launching GDB..."
    gdb -x scripts/gdb_init     # Load initialization script.

Then, the whole debugging process is simplified to:

# In one shell
$ make
$ make qemu_debug

# In another shell
$ make gdb

For instructions on how to use the GDB debugger, check GNU's official manual (and a brief cheatsheet I made before, if you find it useful).

Testing & Unit Testing

We have already decided to develop and test Hux with hardware emulators such as QEMU instead of on physical machines. The No.1 goal of Hux is simplicity, not compatibility across existing hardware platforms. Pros & Cons of using a virtual machine for testing have been clearly stated in the references. Moreover, we still have a long way to go until Hux can be actually deployed on physical machines.

Another thing worth mentioning is unit testing. While unit testing is a really good practice for user-level software developing, it is usually very hard and tedious for kernel development in C. (As far as I know, Philipp has achieved this with Rust.)

Making a comprehensive testbed is a long-term goal. I will try adding some unit-testing modules tests/ folder. Note that the unit tests will have a very small coverage, and some essential functionalities like I/O port communication & interrupt handling are probably not unit-testable.

Progress So Far

In summary, to locate the piece of source code that produces a certain bug, I try the following in order:

  1. See if proper assertion failure message and calling stack is printed.
  2. Use printf and panic to probe the buggy code and fix it.
  3. Launch external GDB debugger with QEMU and debug with breakpoints and other provided information such as register values at breakpoints.
  4. (The problem is low-level and severe,) Manually breaking with hlt instructions.
  5. (On triple faults, when the machine keeps rebooting, ) Add -no-shutdown -no-reboot -d int flags to the QEMU command to let it halt on triple faults and print the log of interrupts.

Current repo structure:

hux-kernel
├── Makefile
├── scripts
│   ├── gdb_init
│   ├── grub.cfg
│   └── kernel.ld
├── src
│   ├── boot
│   │   ├── boot.s
│   │   ├── elf.h
│   │   └── multiboot.h
│   ├── common
│   │   ├── debug.c
│   │   ├── debug.h
│   │   ├── port.c
│   │   ├── port.h
│   │   ├── printf.c
│   │   ├── printf.h
│   │   ├── string.c
│   │   ├── string.h
│   │   ├── types.c
│   │   └── types.h
│   ├── display
│   │   ├── terminal.c
│   │   ├── terminal.h
│   │   └── vga.h
│   └── kernel.c