System calls and interrupts - ghaerr/elks GitHub Wiki

System Calls and Interrupts

All system calls and interrupts get funneled through the exact same entry point, and its here where the magic of exactly controlling the system's response to them is handled. Lets take a quick tour through this very interesting entry point _irqit.

System Calls

System calls in ELKS are implemented by passing each argument in a separate register, then executing an INT 80h. This instructions, "Software Interrupt", saves the CPU flags register (F), the current code segment register (CS), and the current instruction pointer (IP) on the stack, which is pointed to by the stack segment and stack register pair (SS:SP).

Hardware Interrupts

Incoming hardware interrupts to the CPU do the same thing, the F, CS and IP registers are saved on the current stack.

What happens next is the CPU looks up the interrupt number from a table in low memory, and loads a new code segment and instruction pointer (CS:IP) from that table. Since this is "two words", we call it a far pointer.

The Kernel Interrupt Table

In ELKS, each interrupt location (called a vector) in the low memory table goes to yet another table in the kernel, which consists of four bytes each, a 3-byte far call to _irqit, followed by a byte indicating the interrupt number. Here's the entries for interrupt 0 and interrupt 80h, which are the clock and system call interrupts, respectively:

_irq0:                  // Timer
        call    _irqit
        .byte   0

_syscall_int:           // Syscall
        call    _irqit
        .byte   128

The Syscall and Interrupt Funnel

Using the above table, all hardware and software interrupts are directed to a far call to _irqit followed by a byte indicating the interrupt number. Later, you'll see how the saved CS:IP from the far call is used as a far data pointer to get the interrupt number. This works because the far call pushes the saved CS:IP on the stack, and the "saved CS:IP" points to the next "instruction" which is the interrupt number byte. The call to _irqit never returns, as you'll see. The magic begins!

The _irqit Entry Point

We're going to take a step-by-step analysis of each instruction executed in this amazing piece of code. It is here where the extremely precise handling of processes (making system calls) and interrupts (especially including the hardware timer interrupt) occurs. To study on your own, this file is in elks/arch/i86/kernel/irqtab.S.

*
!
!       On entry CS:IP is all we can trust
!
!       There are three possible cases to cope with
!
!       Interrupted user mode or syscall (_gint_count == 0)
!               Switch to process's kernel stack
!               Optionally, check (SS == current->t_regs.ss)
!               and panic on failure
!               On return, task switch allowed
!
!       Interrupted kernel mode, interrupted kernel task
!               or second interrupt (_gint_count == 1)
!               Switch to interrupt stack
!               On return, no task switch allowed
!
!       Interrupted interrupt service routine (_gint_count > 1)
!               Already using interrupt stack, keep using it
!               On return, no task switch allowed
!
!       We do all of this to avoid per process interrupt stacks and
!       related nonsense. This way we need only one dedicated int stack
!
*/

_irqit:
//
//      Make room
//
        push    %ds
        push    %si
        push    %di

The explanation at the top might take a while to understand, but lets start by commenting on each instruction. The first three instructions push the registers DS, SI and DI, in order to use these registers to start working. Remember, this same code is executed by hardware and software interrupts, and so far, exactly the same way.

_irqit Entry Stack

What exactly does the stack look like at this point? The system could be in absolutely any state, that is, it could be that an application is making a system call, in which case SS=DS and both are set to the application data segment. It could be a hardware interrupt during application execution, in which case SS=DS and set to the application data segment. Or a hardware interrupt could have occurred during kernel execution, (that is, after having already entered this code, and running kernel code), in which case the SS=DS and set to the kernel data segment. Or it could be an interrupt that is interrupting an interrupt... we'll get to these cases later. The point is at the time that the system executes this code, there is a stack segment and stack register which was just used to push DS, SI, and DI.

So, there are now seven words pushed onto the interrupted stack (growing downwards):

------
|  F | Flags word at time of interrupt
------
| CS | Code segment at time of interrupt
------
| IP | Instruction pointer at time of interrupt
------
| IP | Address of next (return) instruction after "call _irqit"
------
| DS | Saved DS at time of interrupt
------
| SI | Saved SI at time of interrupt
------
| DI | Saved DI at time of interrupt
------

Hopefully this all makes sense! If not, grab an 8086 processor handbook and read up on interrupts, CPU registers and segments, it will help a lot.

Load DS with the kernel data segment

//
//      Recover data segment
//
        mov     %cs:ds_kernel,%ds

Since CS has been set to the kernel code segment from it having been set in the low memory interrupt table, which pointed to the _irqXX entries discussed above, it is used to load the now-available DS register with the kernel data segment value, which is a global kernel variable placed in the kernel code, rather than data segment. It has to be or we couldn't address it, since DS isn't valid until this point.

Determine whether interrupt occurred during application execution

//
//      Determine which stack to use
//
        cmpw    $1,_gint_count
        jc      utask           // We were in user mode
        jz      itask           // Using a process's kernel stack
ktask:                          // Already using interrupt stack

Things get a little trickier now - a global variable _gint_count (general interrupt count) is compared to 1. This variable (as will soon be seen) counts the number of times this code we are in has been "reentered". If the interrupt occurred when the system was executing a user application (including an application that is making a system call or a clock tick during an application program, this value will be 0. The "jc utask" (jump carry) will exertion if _gint_count is less than 1. Lets follow that path, since that's the normal case.

Load SI to point to the current task structure

utask:
        mov     current,%si
//
//      Switch to kernel stack
//

Here, the value of a global kernel variable current is loaded into SI. This global value points to the kernel's per-task data structure (struct task *), which is a table in the kernel data segment, one for each process.

Since the kernel "knows" that an application is running, and that the interrupt occurred either as a software system call interrupt or hardware interrupt, either had to occur from when the application was running, since _gint_count less than 1 (=0). Read the comments above this entry point, they are starting to make a little more sense now.

Switch to the kernel stack for the current process

Now that SI points to the current task structure, we have a place to save more registers so we can do more work. A portion of the task structure is used to save all the CPU registers, which is a requirement, since ultimately the system could desire to perform a task-switch after the system call or hardware interrupt has fully completed.

        add     $TASK_USER_DI,%si

This adds an offset to SI so that it points to exactly where the register should be saved in the task structure. Here's the declaration of that structure in include/arch/types.h:

/* ordering of saved registers on kernel stack after syscall/interrupt entry*/
struct _registers {
    /* SI offset                 0   2        4   6   8  10  12*/
    __u16       ax, bx, cx, dx, di, si, orig_ax, es, ds, sp, ss;
};

Note in the comment above the 'di' struct member, showing a "0" offset, which is now where SI points.

Save all registers into the current task structure

//
//      Save segment, index, BP and SP registers
//
save_regs:
        incw    _gint_count
        pop     (%si)           // DI
        pop     2(%si)          // SI
        pop     8(%si)          // DS
        pop     %di             // Pointer to interrupt number
        push    %bp             // BP
        mov     %sp,10(%si)     // SP
        mov     %ss,12(%si)     // SS
        mov     %es,6(%si)      // ES
        mov     %ax,4(%si)      // orig_ax

The global _gint_count variable is incremented (to 1 in this case). That will be used later, when interrupts are re-enabled, to cause the system to operate differently, since that case would not be interrupting an application, but interrupting an interrupt.

The next three instructions save the just-pushed DI, SI and DS registers to the task structure, using offsets from SI that match the comments (see the _registers declaration above).

After those three registers have been saved, the top of stack is popped into DI. From above, we know that is the saved IP from the "call _irqit", which "points" at the byte following it, which is the hard-coded interrupt number. This is pretty tricky, so verify it.

Then, the current BP register is pushed onto the original stack.

The last four instructions save the interrupted SS, SP, ES and AX registers into the task struct. Note that we're still executing on that original interrupted stack, no stack switch has yet been done.

//
//      Load new segment and SP registers
//
        mov     %si,%sp
        mov     %ds,%si
        mov     %si,%ss
        mov     %si,%es

Here's where some magic happens. The value of SI, which is now pointing to the current task structure DI member in the kernel, is used to set the SP register. The next instruction loads the kernel data segment DS into SI, which is then used to load SS and ES.

A stack switch has just been performed. We're now "running on the kernel stack", since SS is now the kernel data segment and SP is pointing at the newly saved DI in the current task struct.

The ELKS Task Structure

Here's the declaration of the portion of the ELKS task structure we've just seen used:

struct task {
...

    __u16                       t_kstackm;      /* To detect stack corruption */
    __u8                        t_kstack[KSTACK_BYTES];
    __registers                 t_regs;
};

We're looking at the very last section of the task structure. The t_regs struct is holding the saved registers (shown above), and notice the kernel stack is just below that. That's the area used for the kernel stack during each application's system call. Notice that the registers are saved above the stack, and the code we've just analyzed has used the SI register to index that area and "lay the registers down" through indexed addressing.

Save Remaining Registers

//
//      Save remaining registers
//
        push    %dx             // DX
        push    %cx             // CX
        push    %bx             // BX
        push    %ax             // AX

Since SS:SP is pointing into the _registers area of the current task struct, these four instruction now save the final four registers into the first four locations of that structure. Here's a recap of _registers to remind you:

struct _registers {
    /* SI offset                 0   2        4   6   8  10  12*/
    __u16       ax, bx, cx, dx, di, si, orig_ax, es, ds, sp, ss;
};

See how the AX-DX registers are saved in a negative offset from SI? That's because SS:SP points to the location of DI, so the push instructions do that.

All the registers have now been saved onto the current kernel stack. It's time to process the interrupt!

Process the interrupt

//
//      cs:[di] has interrupt number
//      
        movb    %cs:(%di),%al
        cmpb    $0x80,%al
        jne     updct
//
//      ----------PROCESS SYSCALL----------
//

Remember how DI was pointing to the interrupt number above? Its now used and compared to 80h, which is the interrupt number used for ELKS system calls (as opposed to a hardware interrupt). The last instruction jumps to updct if its a hardware interrupt, but we're following the path of a system call in this overview.

Process the system call


        sti
        call    stack_check     // Check USER stack
        pop     %ax             // Get syscall function code
//
//      syscall(params...)
//
        call    syscall
        push    %ax             // syscall returns a value in ax

Now, just before we're ready to process the system call, interrupts are (finally) re-enabled. This could immediately allow a hardware interrupt to occur, which will come through the same code, but this time, with _gintr_count = 1. More on that later.

A call to stack_check is made, which checks to see if the application's stack has overwritten its allocated area.

Then, the originally passed AX value is popped from the kernel stack (notice how convenient it is that it is stored on the top of stack) and syscall is called, where AX is used to index an array of function pointers to the correct routine, since the system call number was loaded into AX by the application libC library routine.

When the system call returns, the returned value is pushed onto the kernel stack, which will end up being popped back into AX when the registers are restored.

After the system call

//
//      Restore registers
//
        call    do_signal
        cli
        jmp     restore_regs
//
//      Done.
//

After the system call, do_signal is called, which handles calling an application signal handler if present. Interrupts are disabled and we jump to the restore registers routine.

Restoring the registers

//
//      Restore registers and return
//
restore_regs:
        decw    _gint_count
        pop     %ax
        pop     %bx
        pop     %cx
        pop     %dx
        pop     %di
        pop     %si
        pop     %bp             // discard orig_AX
        pop     %es
        pop     %ds
        pop     %bp             // SP
        pop     %ss
        mov     %bp,%sp
        pop     %bp             // user BP
//
//      Iret restores CS:IP and F (thus including the interrupt bit)
//
        iret

We're almost done! With interrupts off, the global general interrupt counter is decremented, and the registers are popped off the kernel stack (not in the order they were pushed, but in the order of the _registers array we looked at previously).

The tricky parts comes with the `pop %bp; pop %ss; mov %bp,%sp" instruction sequence. This reloads SS:SP with the interrupted stack and segment values, performing a kernel-to-user stack switch.

Then, the saved BP which was pushed mid-way through the _irqit routine is restored, and finally an iret (interrupt return) is performed, which restores IP, CS and F, allowing the interrupted code to be resume execution, after the system call.

A magnificent piece of code at the heart of ELKS!