14. User Mode Execution - josehu07/hux-kernel GitHub Wiki

Processes are meaningless if we do not have a clear separation between user privilege and kernel privilege. In an operating system, a user process normally runs in user mode with its own virtual address space (using virtual addresses). When it requires a privileged operation, e.g., printing to terminal or accessing an external device, it invokes a restricted set of interfaces named system calls (syscalls, software interrupts) to trap into the kernel and let the kernel perform privileged operations on its behalf. Different user processes are thus isolated with each other (and with the kernel, of course).

Our kernel needs to clearly define the address space layout of a user process. It must properly set up the page table, pre-map necessary pages, and load the ELF binary at the time of process creation. The kernel also needs to set up a mechanism for crossing the protection boundary (trapping from user mode into kernel mode, and returning from kernel mode back to user mode) to kick off a process in user mode and to allow it to make system calls.

Main References of This Chapter

Scan through them before going forth:

Paging of User Processes

The example init process we wrote in the last chapter is not actually a user process yet. We were just switching ESP to its kernel stack and subsequently setting EIP directly to the embedded binary image. The CPU was still running in kernel privilege - not exactly what we want.

The first step towards enabling user mode execution is to set up paging for user processes.

User Virtual Address Space Layout

We define the virtual address space layout for each user process as follows ✭:

  • Each process has an address space of size 1GiB, so the valid virtual addresses a process could issue range from 0x00000000 to 0x40000000 (USER_MAX)
  • The kernel's virtual address space of size 512MiB is mapped to the bottom-half pages 0x00000000 to 0x20000000 (USER_BASE)
  • The ELF binary (.text + .data + .bss sections) starts from 0x20000000 (USER_BASE) and could take up to 1MiB until 0x20100000 (HEAP_BASE)
  • The region above HEAP_BASE is usable by the process heap, which grows upwards
  • The stack begins at the top-most page (the one below USER_MAX) and grows downwards

Define the sizes @ src/process/layout.h:

/** Virtual address space size: 1GiB. */
#define USER_MAX 0x40000000

/**
 * The lower-half maps the kernel for simplicity (in contrast to
 * typical higher-half design). Application uses the higher-half
 * starting from this address.
 */
#define USER_BASE 0x20000000

/**
 * Hux allows user executable to take up at most 1MiB space, starting
 * at USER_BASE and ending no higher than HEAP_BASE.
 */
#define HEAP_BASE (USER_BASE + 0x00100000)

The "lower-half" design of Hux limits the max supported physical memory size to 512MiB and has several other drawbacks. It also makes the application binary interface (ABI) more fragile because user programs need to be linked with text section at USER_BASE instead of apparent 0x00000000. The benefit is simplicity compared to a "higher-half" kernel: we do not have to worry about any +/- KERNEL_BASE translations in kernel code.

More Page Table Helpers

We will add a few page table allocation and page mapping routines. Let's begin by modifying the page table walking function a bit to let it use the salloc_page() SLAB allocation once the heap allocators have been set up.

Changes to code @ src/memory/paging.c:

/**
 * Helper that allocates a level-2 table. Returns NULL if running out of
 * kernel heap.
 */
static pte_t *
_paging_alloc_pgtab(pde_t *pde, bool boot)
{
    pte_t *pgtab = NULL;
    if (boot)
        pgtab = (pte_t *) _kalloc_temp(sizeof(pte_t) * PTES_PER_PAGE, true);
    else
        pgtab = (pte_t *) salloc_page();
    if (pgtab == NULL)
        return NULL;

    memset(pgtab, 0, sizeof(pte_t) * PTES_PER_PAGE);

    pde->present = 1;
    pde->writable = 1;
    pde->user = 1;      /** Just allow user access on all PDEs. */
    pde->frame = ADDR_PAGE_NUMBER((uint32_t) pgtab);

    return pgtab;
}

static pte_t *
paging_alloc_pgtab(pde_t *pde)
{
    return _paging_alloc_pgtab(pde, false);
}

static pte_t *
paging_alloc_pgtab_at_boot(pde_t *pde)
{
    return _paging_alloc_pgtab(pde, true);
}

/**
 * Walk a 2-level page table for a virtual address to locate its PTE.
 * If `alloc` is true, then when a level-2 table is needed but not
 * allocated yet, will perform the allocation.
 */
static pte_t *
_paging_walk_pgdir(pde_t *pgdir, uint32_t vaddr, bool alloc, bool boot)
{
    size_t pde_idx = ADDR_PDE_INDEX(vaddr);
    size_t pte_idx = ADDR_PTE_INDEX(vaddr);

    /** If already has the level-2 table, return the correct PTE. */
    if (pgdir[pde_idx].present != 0) {
        pte_t *pgtab = (pte_t *) ENTRY_FRAME_ADDR(pgdir[pde_idx]);
        return &pgtab[pte_idx];
    }

    /**
     * Else, the level-2 table is not allocated yet. Do the allocation if
     * the alloc argument is set, otherwise return a NULL.
     */
    if (!alloc)
        return NULL;

    pte_t *pgtab = boot ? paging_alloc_pgtab_at_boot(&pgdir[pde_idx])
                        : paging_alloc_pgtab(&pgdir[pde_idx]);
    if (pgtab == NULL) {
        warn("walk_pgdir: cannot alloc pgtab, out of kheap memory?");
        return NULL;
    }

    return &pgtab[pte_idx];
}

pte_t *
paging_walk_pgdir(pde_t *pgdir, uint32_t vaddr, bool alloc)
{
    return _paging_walk_pgdir(pgdir, vaddr, alloc, false);
}

pte_t *
paging_walk_pgdir_at_boot(pde_t *pgdir, uint32_t vaddr, bool alloc)
{
    return _paging_walk_pgdir(pgdir, vaddr, alloc, true);
}

Next, we add some helper functions for freeing page tables, mapping user page to physical frame, and unmapping user pages. Code @ src/memory/paging.c:

/** Dealloc all the kernal heap pages used in a user page directory. */
void
paging_destroy_pgdir(pde_t *pgdir)
{
    for (size_t pde_idx = 0; pde_idx < PDES_PER_PAGE; ++pde_idx) {
        if (pgdir[pde_idx].present == 1) {
            pte_t *pgtab = (pte_t *) ENTRY_FRAME_ADDR(pgdir[pde_idx]);
            sfree_page(pgtab);
        }
    }

    /** Free the level-1 directory as well. */
    sfree_page(pgdir);
}


/**
 * Find a free frame and map a user page (given by a pointer to its PTE)
 * into physical memory. Returns the physical address allocated, or 0 if
 * memory allocation failed.
 */
uint32_t
paging_map_upage(pte_t *pte, bool writable)
{
    if (pte->present == 1) {
        error("map_upage: page re-mapping detected");
        return 0;
    }

    uint32_t frame_num = frame_bitmap_alloc();
    if (frame_num == NUM_FRAMES)
        return 0;

    pte->present = 1;
    pte->writable = writable ? 1 : 0;
    pte->user = 1;
    pte->frame = frame_num;

    return ENTRY_FRAME_ADDR(*pte);
}

/** Map a lower-half kernel page to the user PTE. */
void
paging_map_kpage(pte_t *pte, uint32_t paddr)
{
    if (pte->present == 1) {
        error("map_kpage: page re-mapping detected");
        return;
    }

    uint32_t frame_num = ADDR_PAGE_NUMBER(paddr);

    pte->present = 1;
    pte->writable = 0;
    pte->user = 0;      /** User cannot access kernel-mapped pages. */
    pte->frame = frame_num;
}

/**
 * Unmap all the mapped pages within a virtual address range in a user
 * page directory. Avoids calling `walk_pgdir()` repeatedly.
 */
void
paging_unmap_range(pde_t *pgdir, uint32_t va_start, uint32_t va_end)
{
    size_t pde_idx = ADDR_PDE_INDEX(va_start);
    size_t pte_idx = ADDR_PTE_INDEX(va_start);
    
    size_t pde_end = ADDR_PDE_INDEX(ADDR_PAGE_ROUND_UP(va_end));
    size_t pte_end = ADDR_PTE_INDEX(ADDR_PAGE_ROUND_UP(va_end));

    pte_t *pgtab = (pte_t *) ENTRY_FRAME_ADDR(pgdir[pde_idx]);

    while (pde_idx <= pde_end && pte_idx < pte_end) {
        /**
         * If end of current level-2 table, or current level-2 table not
         * allocated, go to the next PDE.
         */
        if (pte_idx >= PTES_PER_PAGE || pgdir[pde_idx].present == 0) {
            pde_idx++;
            pte_idx = 0;
            pgtab = (pte_t *) ENTRY_FRAME_ADDR(pgdir[pde_idx]);
            continue;
        }

        if (pgtab[pte_idx].present == 1) {
            frame_bitmap_clear(pgtab[pte_idx].frame);
            pgtab[pte_idx].present = 0;
            pgtab[pte_idx].writable = 0;
            pgtab[pte_idx].frame = 0;
        }

        pte_idx++;
    }
}

Don't forget the declarations @ src/memory/paging.h:

pte_t *paging_walk_pgdir(pde_t *pgdir, uint32_t vaddr, bool alloc);
pte_t *paging_walk_pgdir_at_boot(pde_t *pgdir, uint32_t vaddr, bool alloc);
void paging_destroy_pgdir(pde_t *pgdir);

uint32_t paging_map_upage(pte_t *pte, bool writable);
void paging_map_kpage(pte_t *pte, uint32_t paddr);
void paging_unmap_range(pde_t *pgdir, uint32_t va_start, uint32_t va_end);

Process Page Table Setup

At process creation, we now need to allocate spcae for its page tables on kernel heap and pre-map the necessary pages: the kernel page, the program ELF binary (which needs to be loaded as well), and the first stack page.

The initproc_init() routine @ src/process/process.c is now:

/**
 * Initialize the `init` process - put it in READY state in the process
 * table so the scheduler can pick it up.
 */
void
initproc_init(void)
{
    /** Get the embedded binary of `init.s`. */
    extern char _binary___src_process_init_start[];
    extern char _binary___src_process_init_size[];

    /** Get a slot in the ptable. */
    process_t *proc = _alloc_new_process();
    assert(proc != NULL);
    strncpy(proc->name, "init", sizeof(proc->name) - 1);

    /**
     * Set up page tables and pre-map necessary pages:
     *   - kernel mapped to lower 8MiB
     *   - program ELF binary follows
     *   - top-most stack page
     */
    proc->pgdir = (pde_t *) salloc_page();
    assert(proc->pgdir != NULL);
    memset(proc->pgdir, 0, sizeof(pde_t) * PDES_PER_PAGE);

    uint32_t vaddr_btm = 0;                     /** Kernel-mapped. */
    while (vaddr_btm < PHYS_MAX) {
        pte_t *pte = paging_walk_pgdir(proc->pgdir, vaddr_btm, true);
        assert(pte != NULL);
        paging_map_kpage(pte, vaddr_btm);

        vaddr_btm += PAGE_SIZE;
    }
    
    uint32_t vaddr_elf = USER_BASE;             /** ELF binary. */
    while (elf_curr < elf_end) {
        pte_t *pte = paging_walk_pgdir(proc->pgdir, vaddr_elf, true);
        assert(pte != NULL);
        uint32_t paddr = paging_map_upage(pte, true);
        assert(paddr != 0);
        
        /** Copy ELF content in. */
        memcpy((char *) paddr, elf_curr,
            elf_curr + PAGE_SIZE > elf_end ? elf_end - elf_curr : PAGE_SIZE);

        vaddr_elf += PAGE_SIZE;
        elf_curr += PAGE_SIZE;
    }

    while (vaddr_elf < HEAP_BASE) {             /** Rest of ELF region. */
        pte_t *pte = paging_walk_pgdir(proc->pgdir, vaddr_elf, true);
        assert(pte != NULL);
        uint32_t paddr = paging_map_upage(pte, true);
        assert(paddr != 0);

        vaddr_elf += PAGE_SIZE;
    }
    
    uint32_t vaddr_top = USER_MAX - PAGE_SIZE;  /** Top stack page. */
    pte_t *pte_top = paging_walk_pgdir(proc->pgdir, vaddr_top, true);
    assert(pte_top != NULL);
    uint32_t paddr_top = paging_map_upage(pte_top, true);
    assert(paddr_top != 0);
    memset((char *) paddr_top, 0, PAGE_SIZE);

    /** Set up the trap state for returning to user mode. */
    // We will fill this in later...

    proc->stack_low = vaddr_top;
    proc->heap_high = HEAP_BASE;

    /** Set process state to READY so the scheduler can pick it up. */
    proc->state = READY;
}

Enabling System Calls

This section adds more to the mechanism of crossing the protection boundary, which enables switching back & forth between user mode and kernel mode execution, enabling system calls.

System Call Trap Gate

To enable system calls, we need to pick an ISR number and register it as the system call trap gate. Recall that we have set up 48 interrupt gates: ISR # 0 - 31 for CPU-reserved exceptions, and ISR # 32 - 47 for hardware interrupt requests IRQ # 0 - 15. We pick gate # 64 i.e. 0x40 as the syscall trap gate in Hux ✭.

A complete list of ISR numbers known to the system @ src/interrupt/isr.h:

/**
 * List of known interrupt numbers in this system. Other parts of the kernel
 * should refer to these macro names instead of using plain numbers.
 *   - 0 - 31 are ISRs for CPU-generated exceptions, processor-defined,
 *     see https://wiki.osdev.org/Interrupt_Vector_Table
 *   - 32 - 47 are mapped as custom device IRQs, so ISR 32 means IRQ 0, etc.
 *   - 64 i.e. 0x40 is chosen as our syscall trap gate
 */
#define INT_NO_DIV_BY_ZERO      0   /** Divide by zero. */
//                              1   /** Reserved. */
#define INT_NO_NMI              2   /** Non maskable interrupt (NMI). */
#define INT_NO_BREAKPOINT       3   /** Breakpoint. */
#define INT_NO_OVERFLOW         4   /** Overflow. */
#define INT_NO_BOUNDS           5   /** Bounds range exceeded. */
#define INT_NO_ILLEGAL_OP       6   /** Illegal opcode. */
#define INT_NO_DEVICE_NA        7   /** Device not available. */
#define INT_NO_DOUBLE_FAULT     8   /** Double fault. */
//                              9   /** No longer used. */
#define INT_NO_INVALID_TSS      10  /** Invalid task state segment (TSS). */
#define INT_NO_SEGMENT_NP       11  /** Segment not present. */
#define INT_NO_STACK_SEG        12  /** Stack segment fault. */
#define INI_NO_PROTECTION       13  /** General protection fault. */
#define INT_NO_PAGE_FAULT       14  /** Page fault. */
//                              15  /** Reserved. */
#define INT_NO_FPU_ERROR        16  /** Floating-point unit (FPU) error. */
#define INT_NO_ALIGNMENT        17  /** Alignment check */
#define INT_NO_MACHINE          18  /** Machine check. */
#define INT_NO_SIMD_FP          19  /** SIMD floating-point error. */
//                         20 - 31  /** Reserved. */

#define IRQ_BASE_NO     32
#define INT_NO_TIMER    (IRQ_BASE_NO + 0)
#define INT_NO_KEYBOARD (IRQ_BASE_NO + 1)

/** INT_NO_SYSCALL is 64, defined in `syscall.h`. */

The syscall trap gate has a different flag field 0xEF from the other interrupt gates (check the comments in IDT code for detailed explanations):

  • It expects the caller to only have user privilege, so the DPL field of the flags should be set to the users' ring level: 3
  • The type of this gate is a trap gate instead of an interrupt gate: interrupt gates disable interrupts automatically upon entry and re-enable interrupts once returning with iret, while trap gates do not disable interrupts for us; for some syscalls, it is not necessary to disable interrupts

Register the syscall trap gate at IDT loading @ src/interrupt/idt.c:

/** Extern the syscall trap gate handler. */
extern void syscall_handler(void);


void
idt_init()
{
    ...

    // Add this to the long list of `idt_set_gate()` calls.
    /**
     * Register user syscall trap gate. The flag here is different in
     * two fields:
     *   - DPL: user process is in privilege ring 3 instead of 0
     *   - Type: syscall gate is normally registered as a "trap gate"
     *           instead of "interrupt gate"; trap gates do not disable
     *           interrupts automatically upon entry
     */
    idt_set_gate(INT_NO_SYSCALL, (uint32_t) syscall_handler,
                 SEGMENT_KCODE << 3, 0xEF);

    ...
}

Also add a handler stub for this gate number @ src/interrupt/isr-stub.s:

/**
 * The wrapper for the syscall trap handler. Calls the centralized ISR
 * handler stub as well.
 */
.global syscall_handler
.type syscall_handler, @function
syscall_handler:
    pushl $0
    pushl $64
    jmp isr_handler_stub

Task State Segment (TSS)

On x86 architectures, the concept of tasks makes more sense if we are utilizing hardware multitasking (hardware-aided context switches). A task state segment (TSS) is a segment to be registered in GDT that holds the information of a task context.

Hux only adopts software multitasking, yet it is still required to set up one TSS per CPU where system calls might happen, whenever entering user mode execution of a process. The CPU automatically uses the information stored in this one TSS (SS & ESP register values, essentially) to switch to the process's kernel stack upon a boundary cross from user mode into kernel mode.

Define the format of x86 32bit task state @ src/interrupt/syscall.h:

/** Syscall trap gate registerd at a vacant ISR number. */
#define INT_NO_SYSCALL 64   /** == 0x40 */


/**
 * Task state segment (TSS) x86 IA32 format,
 * see https://wiki.osdev.org/Task_State_Segment#x86_Structure.
 */
struct task_state_segment {
    uint32_t link;      /** Old TS selector. */
    uint32_t esp0;      /** Stack pointer after privilege level boost. */
    uint8_t  ss0;       /** Segment selector after privilege level boost. */
    uint8_t  pad1;
    uint32_t *esp1;
    uint8_t  ss1;
    uint8_t  pad2;
    uint32_t *esp2;
    uint8_t  ss2;
    uint8_t  pad3;
    uint32_t cr3;       /** Page directory base address. */
    uint32_t *eip;      /** Saved EIP from last task switch. Same for below. */
    uint32_t eflags;
    uint32_t eax;
    uint32_t ecx;
    uint32_t edx;
    uint32_t ebx;
    uint32_t *esp;
    uint32_t *ebp;
    uint32_t esi;
    uint32_t edi;
    uint8_t  es;
    uint8_t  pad4;
    uint8_t  cs;
    uint8_t  pad5;
    uint8_t  ss;
    uint8_t  pad6;
    uint8_t  ds;
    uint8_t  pad7;
    uint8_t  fs;
    uint8_t  pad8;
    uint8_t  gs;
    uint8_t  pad9;
    uint8_t  ldt;
    uint8_t  pad10;
    uint8_t  pad11;
    uint8_t  iopb;       /** I/O map base address. */
} __attribute__((packed));
typedef struct task_state_segment tss_t;

Add a new field task_state to cpu_state that holds the actual content of current running process's task state.

// src/process/scheduler.h

/** Per-CPU state (we only have a single CPU). */
struct cpu_state {
    /** No ID field because only supporting single CPU. */
    process_context_t *scheduler;   /** CPU scheduler context. */
    process_t *running_proc;        /** The process running or NULL. */
    tss_t task_state;               /** Current process task state. */
};
typedef struct cpu_state cpu_state_t;

The 6th segment of our GDT finally makes sense - it describes a segment that holds the current running process's task state. This means, before any context switch to a user process, we need to reload this entry with the task state segment of the process we are going to switch to.

Make a TSS switch routine @ src/memory/gdt.c:

/**
 * Set up TSS for a process to be switched, so that the CPU will be able
 * to jump to its kernel stack when a system call happens.
 * Check out https://wiki.osdev.org/Task_State_Segment for details.
 */
void
gdt_switch_tss(tss_t *tss, process_t *proc)
{
    assert(proc != NULL);
    assert(proc->pgdir != NULL);
    assert(proc->kstack != 0);

    /**
     * Task state segment (TSS) has:
     *
     * Access Byte -
     *   - Pr    = 1: present
     *   - Privl = 0: kernel privilege
     *   - S     = 0: it is a system segment
     *   - Ex    = 1: executable
     *   - DC    = 0: conforming
     *   - RW    = 0: readable code
     *   - Ac    = 1: accessed
     *   Hence, 0x89.
     */
    gdt_set_entry(5, (uint32_t) tss, (uint32_t) (sizeof(tss_t) - 1),
                  0x89, 0x00);

    /** Fill in task state information. */
    tss->ss0 = SEGMENT_KDATA << 3;              /** Kernel data segment. */
    tss->esp0 = proc->kstack + KSTACK_SIZE;     /** Top of kernel stack. */
    tss->iopb = sizeof(tss_t);  /** Forbids e.g. inb/outb from user space. */
    tss->ebp = 0;   /** Ensure EBP is 0 on switch, for stack backtracing. */

    /**
     * Load task segment register. Segment selectors need to be shifted
     * to the left by 3, because the lower 3 bits are TI & RPL flags.
     */
    uint16_t tss_seg_reg = SEGMENT_TSS << 3;
    asm volatile ( "ltr %0" : : "r" (tss_seg_reg) );
}


// src/interrupt/gdt.h

void gdt_switch_tss(tss_t *tss, process_t *proc);

We will talk about actual system call handlers implemention in the next chapter.

Jumping Into User Mode

Seems already quite a lot of stuff to fully take in - and yet, we are still one step away from "user world"! The last missing step is: how to let the processor jump into user mode privilege and start executing the code section of a user program. This indeed involves a neat trick on carefully placing something on the new process's kernel stack.

Recall that in the centralized interrupt handler stub isr_handler_stub in isr-stub.s, it saves the state of an interrupt on stack, calls the isr_handler() function, and then restores that state and returns from trap. The trick we will use here is to "fake" an interrupt state on the process kernel stack, set the state's segments information with DPL_USER flag (ring-3 user privilege), and make the process do a return-from-trap ✭.

Expose the return part of the interrupt handler @ src/interrupt/isr-stub.s:

isr_handler_stub:

    ...

    /** == Calls the ISR handler. == **/
    call isr_handler
    /** == ISR handler finishes.  == **/

    addl $4, %esp   /** Cleans up the pointer argument. */

/** Return falls through to the `return_from_trap` snippet below. */
.global return_from_trap
return_from_trap:

    /** Restore previous segment descriptor. */
    popl %eax
    movw %ax, %ds
    movw %ax, %es
    movw %ax, %fs
    movw %ax, %gs

    /** Restores EDI, ESI, EBP, ESP, EBX, EDX, ECX, EAX. */
    popal

    addl $8, %esp   /** Cleans up error code and ISR number. */

    iret            /** This pops EIP, CS, EFLAGS, User's ESP, SS. */

In the last chapter, we directly put a process_context_t struct on the kernel stack with EIP pointing to the embedded init program binary. Let's still assume that we have that binary embedded, since we do not have file system support yet to load an image from persistent storage. However, instead of jumping right into that embedded binray location (which certainly won't work if we set up user mode execution correctly - that address belongs to the kernel), we now push the context right below the trap state and jump to the return_from trap snippet. In this way, the snippet pops what's on stack right now (our faked interrupt state with stored EIP pointing to ELF text section address 0x20000000) and starts executing that user program with controlled privilege.

Modifications to code @ src/process/process.c:

/**
 * Find an UNUSED slot in the ptable and put it into INITIAL state. If
 * all slots are in use, return NULL.
 */
static process_t *
_alloc_new_process(void)
{
    ...

    /** Allocate kernel stack. */
    proc->kstack = salloc_page();
    if (proc->kstack == 0) {
        warn("new_process: failed to allocate kernel stack page");
        return NULL;
    }
    uint32_t sp = proc->kstack + KSTACK_SIZE;

    /** Make proper setups for the new process. */
    proc->state = INITIAL;
    proc->pid = next_pid++;

    /**
     * Leave room for the trap state. The initial context will be pushed
     * right below this trap state, with return address EIP pointing to
     * `trapret` (the return-from-trap part of `isr_handler_stub`). In this
     * way, the new process, after context switched to by the scheduler,
     * automatically jumps into user mode execution. 
     */
    sp -= sizeof(interrupt_state_t);
    proc->trap_state = (interrupt_state_t *) sp;
    memset(proc->trap_state, 0, sizeof(interrupt_state_t));

    sp -= sizeof(process_context_t);
    proc->context = (process_context_t *) sp;
    memset(proc->context, 0, sizeof(process_context_t));
    proc->context->eip = (uint32_t) return_from_trap;

    return proc;
}


/**
 * Initialize the `init` process - put it in READY state in the process
 * table so the scheduler can pick it up.
 */
void
initproc_init(void)
{
    ...

    /** Set up the trap state for returning to user mode. */
    proc->trap_state->cs = (SEGMENT_UCODE << 3) | 0x3;  /** DPL_USER. */
    proc->trap_state->ds = (SEGMENT_UDATA << 3) | 0x3;  /** DPL_USER. */
    proc->trap_state->ss = proc->trap_state->ds;
    proc->trap_state->eflags = 0x00000202;      /** Interrupt enable. */
    proc->trap_state->esp = USER_MAX - 4;   /** GCC might push an FP. */
    proc->trap_state->eip = USER_BASE;   /** Beginning of ELF binary. */

    proc->stack_low = vaddr_top;
    proc->heap_high = HEAP_BASE;

    /** Set process state to READY so the scheduler can pick it up. */
    proc->state = READY;
}


// src/process/process.h

struct process {
    ...
    interrupt_state_t *trap_state;  /** Trap state used at creation. */
};

Finally, the scheduler loop now goes:

// src/process/scheduler.c

/** CPU scheduler, never leaves this function. */
void
scheduler(void)
{
    cpu_state.running_proc = NULL;

    while (1) {     /** Loop indefinitely. */
        /** Look for a ready process in ptable. */
        process_t *proc;
        for (proc = ptable; proc < &ptable[MAX_PROCS]; ++proc) {
            if (proc->state != READY)
                continue;

            info("scheduler: going to context switch to '%s'", proc->name);

            /** Set up TSS for this process, and switch page directory. */
            gdt_switch_tss(&(cpu_state.task_state), proc);
            paging_switch_pgdir(proc->pgdir);
            
            cpu_state.running_proc = proc;
            proc->state = RUNNING;

            /** Do the context switch. */
            context_switch(&(cpu_state.scheduler), proc->context);

            /** It switches back, switch to kernel page directory. */
            paging_switch_pgdir(kernel_pgdir);
            cpu_state.running_proc = NULL;
        }
    }
}

A Quick Recap of System State

Let's do a quick recap of how the system state evolves/changes since booting ✭:

  • Which stack (ESP) is in use:
    • Init-phase code and the scheduler (whenever switched back): uses the kernel booting stack
    • A process in normal user mode execution: uses its user stack of virtual address below 0x40000000, mapped in its page table to some frames in physical memory region 8MiB - 128MiB
    • A process in creation, or when it issues a system call to trap into kernel mode, or when it gets interrupted by external hardware such as the timer so the interrupt handler runs: uses its kernel stack, which is a page allocated on kernel heap
  • Which intruction (EIP) is the CPU running:
    • Any kernel code: in the code section of the kernel image, loaded into physical memory region 1MiB - end of .shstrtab
    • Our temporarily embedded user process ELF binary: somewhere in the kernel image as well
    • ELF loaded into user process address space (copied from an embedded binary or loaded from disk): in its user code section of virtual address starting at 0x20000000, mapped in its page table to some frames in physical memory region 8MiB - 128MiB
  • Which page directory is in use:
    • Init-phase code and the scheduler: the kernel page directory
    • A process, no matter in user mode or in trap: the process's page directory

Progress So Far

To try out user mode execution, let's force our init program to do a page fault and see if our current page fault handler catches this page fault and reports that it is from user mode! Code @ src/process/init.s:

.global start
.type start, @function
start:

    /** Trigger a page fault by writing to kernel-mapped memory address. */
    movl $0x00600000, %eax
    movl $123, (%eax)

    ret

This should produce a terminal window as the following after booting up:

We will talk about completing the page fault (PF) handler and the syscalls handler in the next chapter. There are two more exceptions other than PF that are worth catching:

  • Double fault (DF): when the CPU generates a fault that isn't captured & resolved by the OS, it generates a double fault; When a DF isn't captured, or if a fault happens in a DF handler, the CPU generates a triple fault and resets itself. A rebooting loop in OS development typically means a triple fault situation.
  • General protection fault (GPF): as the name suggests, see this page.

Current repo structure:

hux-kernel
├── Makefile
├── scripts
│   ├── gdb_init
│   ├── grub.cfg
│   └── kernel.ld
├── src
│   ├── boot
│   │   ├── boot.s
│   │   ├── elf.h
│   │   └── multiboot.h
│   ├── common
│   │   ├── debug.c
│   │   ├── debug.h
│   │   ├── port.c
│   │   ├── port.h
│   │   ├── printf.c
│   │   ├── printf.h
│   │   ├── string.c
│   │   ├── string.h
│   │   ├── types.c
│   │   └── types.h
│   ├── device
│   │   ├── keyboard.c
│   │   ├── keyboard.h
│   │   ├── timer.c
│   │   └── timer.h
│   ├── display
│   │   ├── terminal.c
│   │   ├── terminal.h
│   │   └── vga.h
│   ├── interrupt
│   │   ├── idt-load.s
│   │   ├── idt.c
│   │   ├── idt.h
│   │   ├── isr-stub.s
│   │   ├── isr.c
│   │   ├── isr.h
│   │   └── syscall.h
│   ├── memory
│   │   ├── gdt-load.s
│   │   ├── gdt.c
│   │   ├── gdt.h
│   │   ├── kheap.c
│   │   ├── kheap.h
│   │   ├── paging.c
│   │   ├── paging.h
│   │   ├── slabs.c
│   │   └── slabs.h
│   ├── process
│   │   ├── init.s
│   │   ├── layout.h
│   │   ├── process.c
│   │   ├── process.h
│   │   ├── scheduler.c
│   │   ├── scheduler.h
│   │   └── switch.s
│   └── kernel.c