17. Time‐Sharing Scheduler - josehu07/hux-kernel GitHub Wiki

Processes space-share the memory and time-share the CPU. One of the major tasks of an OS kernel is to provide support for orchestrating all the running processes on limited physical resource. On the memory side, the virtual memory system enables such sharing. On the CPU side, it is the job of the scheduler.

Main References of This Chapter

Scan through them before going forth:

Time-Sharing & Timer Interrupt Logic

The idea of time-sharing may look less intuitive than space-sharing (where processes take up the space they need in a space-oriented resource, e.g., main memory space or disk drive space) at first glance. Basically, time-sharing means that processes share a time-oriented execution unit (e.g., single-CPU computation or bus communication) by taking up the unit for a short period of time and then switching to the next one in queue.

The OS kernel thus needs a way to keep track of when to put the current running process aside and schedule the next one. Naturally, the timer can take this job. At every timer interrupt, we check if the running process on the CPU has run out of its current share, and if so, context-switch to the scheduler context to let it decide who runs next.

Our new timer interrupt implementation @ src/device/timer.c:

/**
 * Timer interrupt handler registered for IRQ # 0.
 * Mostly borrowed from xv6's `trap()` code.
 */
static void
timer_interrupt_handler(interrupt_state_t *state)
{
    (void) state;   /** Unused. */
    
    /** Increment global timer tick. */
    timer_tick++;

    /**
     * Check all sleeping processes, set them back to READY if desired
     * ticks have passed.
     */
    for (process_t *proc = ptable; proc < &ptable[MAX_PROCS]; ++proc) {
        if (proc->state == BLOCKED && proc->block_on == ON_SLEEP
            && timer_tick >= proc->target_tick) {
            proc->target_tick = 0;
            process_unblock(proc);
        }
    }

    process_t *proc = running_proc();
    bool user_context = (state->cs & 0x3) == 3      /** DPL field is 3. */
                        && proc != NULL;

    /**
     * If we are now in the process's context and it is set to be killed,
     * exit the process right now.
     */
    if (user_context && proc->killed)
        process_exit();

    /**
     * If we are in a process and the process is in RUNNING state, yield
     * to the scheduler to force a new scheduling decision. Could happen
     * to a provess in kernel context (during a syscall) as well.
     */
    if (proc != NULL && proc->state == RUNNING) {
        proc->state = READY;
        yield_to_scheduler();   // Will come to this later.
    }

    /** Re-check if we get killed since the yield. */
    if (user_context && proc->killed)
        process_exit();
}

Our ISR handler also needs some modification. Previously, for any external device IRQ, we send back the EOI (end-of-interrupt) signal to the PIC controller only after the handler returns. It causes an unapparent but crucial problem here, that a process A could yield to the scheduler in the middle of the timer interrupt handler, leaving the PIT timer screened for the next few processes scheduled (so they run to completion or until they block themselves), until process A gets scheduled again and leaves the timer interrupt handler.

To solve this issue, we move the sending of EOI for timer interrupts up, before calling the handler. It is OK to allow such early EOI because interrupts are automatically disabled when entering an interrupt gate routine.

Modifications @ src/interrupt/isr.c:

/**
 * ISR handler written in C.
 *
 * Receives a pointer to a structure of interrupt state. Handles the
 * interrupt and simply returns. Can modify interrupt state through
 * this pointer if necesary.
 */
void
isr_handler(interrupt_state_t *state)
{
    uint8_t int_no = state->int_no;

    ...

    /** An IRQ-translated interrupt from external device. */
    else if (int_no <= 47) {
        uint8_t irq_no = state->int_no - 32;

        /** Call actual ISR if registered. */
        if (isr_table[int_no] == NULL)
            _missing_handler(state);
        else {
            /**
             * Send back timer interrupt EOI to PIC early because it may
             * yield in the timer interrupt to the scheduler, leaving
             * the PIT timer blocked for the next few processes scheduled.
             */
            if (state->int_no == INT_NO_TIMER)
                _pic_send_eoi(irq_no);
            isr_table[int_no](state);
            if (state->int_no != INT_NO_TIMER)
                _pic_send_eoi(irq_no);
        }
    }

    ...
}

Implementing a Scheduling Algorithm

Mechanism: Yielding

We define yielding as the action of a process giving up the CPU to the scheduler. It is basically just a context-switch to the scheduler context.

Code @ src/process/scheduler.c:

/** Return back to the scheduler context. */
void
yield_to_scheduler(void)
{
    process_t *proc = running_proc();
    assert(proc->state != RUNNING);
    context_switch(&(proc->context), cpu_state.scheduler);
}

There are two cases of a process yielding to the scheduler in Hux ✭:

  1. Forced yielding at a timer interrupt if its current share has run out;
  2. Voluntary blocking to wait for something to happen. Possible cases include:
    1. On sleep, when it calls the sleep() syscall;
    2. On wait, when it calls the wait() syscall to wait for a child process to exit and none has finished yet;
    3. On keyboard input, when the kbdstr() syscall handler waits for the next keyboard event;
    4. (In persistence chapters, ) On regular file I/O;
    5. (In the future, ) On communicating with other device types.
// src/process/process.h

/** Process state. */
enum process_block_on {
    NOTHING,
    ON_SLEEP,
    ON_WAIT,
    ON_KBDIN
    // ... (TODO)
};
typedef enum process_block_on process_block_on_t;

enum process_state {
    UNUSED,     /** Indicates PCB slot unused. */
    INITIAL,
    READY,
    RUNNING,
    BLOCKED,
    TERMINATED
};
typedef enum process_state process_state_t;

Policy: Weighted Round-Robin (RR)

We implement a simple scheduling policy in Hux, which we call Weighted Round-Robin. Since the introduction to user mode execution, Hux has been using a basic Round-Robin (RR) scheduler borrowed from xv6 that simply loops over the process table and pick a ready process to run for one timer tick. We now introduce a new field, timeslice, to each process PCB, representing the number of contiguous timer ticks a process could get scheduled in a row every time it is picked.

// src/process/process.h

struct process {
    ...
    uint8_t timeslice;              /** Timeslice length for scheduling. */
};
typedef struct process process_t;

The valid values for timeslice lengths are integers from 1 to 16. The init process has a default timeslice length of 1. All other processes are created by forking existing process. Hence, we extend the fork() syscall interface to take in an argument which is the timeslice length of the new child process.

If the timeslice argument is 0, child process inherits its timeslice value from the parent process. Otherwise, use the given value.

int8_t fork(uint8_t timeslice);

Implementation @ src/process/sysproc.c:

int32_t
syscall_fork(void)
{
    uint32_t timeslice;

    if (!sysarg_get_uint(0, &timeslice))
        return SYS_FAIL_RC;
    if (timeslice > 16) {
        warn("fork: timeslice value cannot be larger than 16");
        return SYS_FAIL_RC;
    }

    if (timeslice == 0)
        return process_fork(running_proc()->timeslice);
    return process_fork(timeslice);
}

In the scheduler loop, instead of always moving to the next process in ptable order, it should keep choosing the current process until its timeslice has run out. Code @ src/process/scheduler.c:

/** CPU scheduler, never leaves this function. */
void
scheduler(void)
{
    cpu_state.running_proc = NULL;

    while (1) {     /** Loop indefinitely. */
        /** Look for a ready process in ptable. */
        process_t *proc;
        for (proc = ptable; proc < &ptable[MAX_PROCS]; ++proc) {
            if (proc->state != READY)
                continue;

            /** Schedule this one for at most its `timeslice` ticks. */
            uint32_t next_sched_tick = timer_tick + proc->timeslice;
            while (timer_tick < next_sched_tick && proc->state == READY) {
                // info("scheduler: going to context switch to %d - '%s'",
                //      proc->pid, proc->name);

                /** Set up TSS for this process, and switch page directory. */
                gdt_switch_tss(&(cpu_state.task_state), proc);
                paging_switch_pgdir(proc->pgdir);
                
                cpu_state.running_proc = proc;
                proc->state = RUNNING;

                /** Do the context switch. */
                context_switch(&(cpu_state.scheduler), proc->context);

                /** It switches back, switch to kernel page directory. */
                paging_switch_pgdir(kernel_pgdir);
                cpu_state.running_proc = NULL;
            }
        }
        }
    }
}

It is very useful to allow different processes, which are likely to represent different tasks, to have different scheduling properties. Two things are worth noticing about the above weighted RR scheduling implementation. First, it is a slice-based scheduler, as compared to priority-based scheduler used in many modern OSes. Second, it does not allow changing the timeslice value of a process once it is created, making it somewhat limited for processes with evolving computation need over time ✭.

Notes on Locking & Interrupt Enable/Disable

As said in the last chapter, Hux so far does not enforce any form of protection over shared data among processes. Though such protection might seem unnecessary for a single-CPU architecture at first glance, it is indeed critical to the correctness of the system due to the nature of time-sharing.

Imagine that process A is in the middle of some syscall, iterating over the global process table or waiting for keyboard input. If a timer interrupt happens at this time and the scheduler picks process B to run, B could issue some syscall that manipulate the process table or take over the keyboard input destination buffer. When it comes back to A, things could go wrong very badly.

Generally, operating systems provide a wide variety of synchronization primitives, such as various kinds of locks, to protect over shared data/resource. A process "acquires" an associated lock before touching the shared data and "releases" the lock after the critical section is done.

In Hux, due to the fact that there is only one CPU core, let's first introduce a simpler solution than locks that works in most of the cases. The most common case of breaking a critical section is when a timer interrupt happens in the middle and a different process gets scheduled (, unless a process voluntarily block during a critical section). Hence, in some cases, it is enough to just disable interrupts on the CPU (cli) before entering a critical section and enable interrupts (sti) afterwards.

Introduce two new fields to the global CPU state @ src/process/scheduler.h:

/** Per-CPU state (we only have a single CPU). */
struct cpu_state {
    ...
    bool int_enabled;               /** Remembered interrupt e/d state. */
    uint8_t cli_depth;              /** Number of pushed `cli`s. */
};
typedef struct cpu_state cpu_state_t;

Then, mimicking the xv6 coding style, we make two helper functions for pushing & popping a "cli stack". The CPU could be in interrupt-enabled or -disabled state upon a push to an empy cli stack, when it remember this old state and disables interrupts. Upon the point when the stack goes empty, it returns to the remembered state. Interrupts are guaranteed to be disabled whenever the stack is non-empty. pops must be one-one matched to pushes in code.

// src/common/intstate.c

/** Check if interrupts are enabled. */
inline bool
interrupt_enabled(void)
{
    uint32_t eflags;
    asm volatile ( "pushfl; popl %0" : "=r" (eflags) : );
    return eflags & 0x0200;     /** IF flag. */
}


/** Disable interrupts if not yet so. */
void
cli_push(void)
{
    bool was_enabled = interrupt_enabled();
    asm volatile ( "cli" );

    /**
     * If cli stack previously empty, remember the previous interrupt
     * enable/disable state.
     */
    if (cpu_state.cli_depth == 0)
        cpu_state.int_enabled = was_enabled;
    cpu_state.cli_depth++;
}

/**
 * Restore interrupt e/d state to previous state if all `cli`s have been
 * popped. Must be one-one mapped to `cli_push()` in code.
 */
void
cli_pop(void)
{
    assert(!interrupt_enabled());
    assert(cpu_state.cli_depth > 0);

    cpu_state.cli_depth--;
    if (cpu_state.cli_depth == 0 && cpu_state.int_enabled)
        asm volatile ( "sti" );
}


// src/common/intstate.h
bool interrupt_enabled(void);

void cli_push(void);
void cli_pop(void);

Several places in the Hux code need to be enhanced by adding cli_push() & cli_pop() pairs, if we are going for this method only:

  • Terminal printing & manipulation
  • Keyboard input listening
  • Accessing timer ticks value
  • Accessing the process table
  • Getting the running process
  • Changing process state (might not be atomic)
  • Allocating/freeing on kernel heap
  • Setting/clearing the frame bitmap
  • Entering/leaving the scheduler context - cli must have been pushed whenever inside the scheduler context. This implies two modifications: 1. Must do cli_push() before yielding to the scheduler, and 2. The first action of a newly allocated process should be a cli_pop() to clear the one held in the scheduler context.

On multi-core (SMP and further) machines, simply disabling/enabling interrupts is neither enough nor efficient in almost any case. It makes more sense to at least split it into different locks that protect different resources. Proper handling of concurrency & synchronization is a big topic in any modern, practical OS kernel, and is where tons of optimizations go. We will switch to mutex locks in later chapters.

Progress So Far

Let's try out our weighted RR scheduler! Here is an example user program that demonstrates the effect of timeslice values. The parent process forks three children processes with different timeslice lengths, running the same computation workload. We expect the one with longer timeslice length to finish earlier.

static void
_test_child_workload(void)
{
    int32_t res = 0;
    for (int32_t i = 12345; i < 57896; ++i)
        for (int32_t j = i; j > 12345; --j)
            res += (i * j) % 567;
    printf("res %d: %d\n", getpid(), res);
}

void
main(void)
{
    _shell_welcome_logo();

    int8_t i;

    printf("parent: forking...\n");
    for (i = 1; i <= 3; ++i) {
        int8_t pid = fork(i*4);
        if (pid < 0)
            error("parent: forking child i=%d failed", i);
        if (pid == 0) {
            // Child.
            _test_child_workload();
            exit();
        } else
            printf("parent: forked child pid=%d, timeslice=%d\n", pid, i*4);
    }

    printf("parent: waiting...\n");
    for (i = 1; i <= 3; ++i) {
        int8_t pid = wait();
        printf("parent: waited child pid=%d\n", pid);
    }

    _fake_halt();
}

This should produce a terminal window as the following after booting up:

Current repo structure:

hux-kernel
├── Makefile
├── scripts
│   ├── gdb_init
│   ├── grub.cfg
│   └── kernel.ld
├── src
│   ├── boot
│   │   ├── boot.s
│   │   ├── elf.h
│   │   └── multiboot.h
│   ├── common
│   │   ├── debug.c
│   │   ├── debug.h
│   │   ├── intstate.c
│   │   ├── intstate.h
│   │   ├── port.c
│   │   ├── port.h
│   │   ├── printf.c
│   │   ├── printf.h
│   │   ├── string.c
│   │   ├── string.h
│   │   ├── types.c
│   │   └── types.h
│   ├── device
│   │   ├── keyboard.c
│   │   ├── keyboard.h
│   │   ├── sysdev.c
│   │   ├── sysdev.h
│   │   ├── timer.c
│   │   └── timer.h
│   ├── display
│   │   ├── sysdisp.c
│   │   ├── sysdisp.h
│   │   ├── terminal.c
│   │   ├── terminal.h
│   │   └── vga.h
│   ├── interrupt
│   │   ├── idt-load.s
│   │   ├── idt.c
│   │   ├── idt.h
│   │   ├── isr-stub.s
│   │   ├── isr.c
│   │   ├── isr.h
│   │   ├── syscall.c
│   │   └── syscall.h
│   ├── memory
│   │   ├── gdt-load.s
│   │   ├── gdt.c
│   │   ├── gdt.h
│   │   ├── kheap.c
│   │   ├── kheap.h
│   │   ├── paging.c
│   │   ├── paging.h
│   │   ├── slabs.c
│   │   ├── slabs.h
│   │   ├── sysmem.c
│   │   └── sysmem.h
│   ├── process
│   │   ├── layout.h
│   │   ├── process.c
│   │   ├── process.h
│   │   ├── scheduler.c
│   │   ├── scheduler.h
│   │   ├── switch.s
│   │   ├── sysproc.c
│   │   └── sysproc.h
│   └── kernel.c
├── user
│   ├── lib
│   │   ├── debug.h
│   │   ├── malloc.c
│   │   ├── malloc.h
│   │   ├── printf.c
│   │   ├── printf.h
│   │   ├── string.c
│   │   ├── string.h
│   │   ├── syscall.h
│   │   ├── syscall.s
│   │   ├── syslist.s
│   │   ├── types.c
│   │   └── types.h
│   └── init.c