09. Intro to Virtual Memory - josehu07/hux-kernel GitHub Wiki

With the basic skeleton, booting, and display device set up, we can now dive into building the heart of an OS kernel. We will follow the OSTEP book and build OS support in these aspects in the following order:

  1. Virtualization (first the memory, then the CPU)
    1. Virtualizing the memory: paging, heap memory allocator
    2. Virtualizing the CPU: user processes, scheduling
  2. Persistence: hard disk driver, file system
  3. Concurrency: multi-threading, synchronization

Starting from this chapter, we will implement a significant part of our kernel - memory management. One of the major reponsibilities of an OS kernel is to provide memory abstraction for user programs. Such abstraction should also guarantee at least the following two properties:

  • Access control: allow us to describe specific permissions on different regions of the memory space
  • Isolation: memory operations issued by one process should not directly interfere with other processes (and the kernel, obviously)

Access control and isolation together ensure that, for example, an out-of-bound array access in a buggy browser won't crash your pdf reader or even the OS itself. The key to achieve such isolation is virtualization on memory (the virtual memory system). This is not trivial. This chapter involves relatively intense explanation of the basic theory behind, instead of diving directly into the code.

Main References of This Chapter

Scan through them before going forth:

Virtual Memory Basics

The goal of access control + isolation implies the split between logical memory address space and actual physical memory. We provide each user program the illusion that it possesses an idealized, contiguous, large piece of memory only belonging to itself - called its virtual memory address space. OS kernel then implements a way to map those actually occupied memory regions of all programs onto physical memory (and translate between them).

Typically, virtual memory mapping is implemented through one of the two approaches:

  • Segmentation
  • Paging

Why Not Direct Segmentation: External Fragmentation

You may have noticed that segmentation is one possible way of doing this. An example where two user programs are running concurrently, each using 150 bytes of memory, on a machine with 550 bytes of physical memory (thanks Philipp for the nice figures!):

The two programs' logical 150 bytes get mapped to two different segments on physical memory. If another program who wants 150 bytes of memory is launching up (which does not exceed the total remaining available memory), there is no place to insert a third contiguous 150 bytes segment. The spaces in between segments are called (external) fragments.

Processes may come and go in high frequency, and their memory patterns evolve along their timespan. With contiguous segmentation, we cannot avoid fragments. The kernel would have to implement compensate techniques such as kernel-level garbage collection (GC) to put these memory segments together and make room for other processes. GC normally requires pause of execution and should be run periodically, resulting in lower performance and responsiveness from time to time.

Modern Solution: Paging

Paging is a commonly-used approach of memory virtualization with finer granularity, hence reduces external fragments. We divide memory spaces into pages, i.e., 4 KiB chunks. Both the physical memory and user processes' virtual memory spaces are divided in this way. Conventionally, we call a physical 4 KiB chunk of memory a frame, and we call a virtual 4 KiB chunk of memory a page ✭.

For each user process, we keep a page table somewhere in kernel memory to store user logical address - physical memory address mapping information. (Thanks Philipp again for the nice figures!)

Though paging solves the problem of external fragmentation, it actually makes internal fragmentation worse - user processes that do not need a full 4 KiB slot of memory have to occupy a whole 4 KiB slot.

Nevertheless, having internal fragments are still better than having external fragments, as they do not require a dedicated cleaner and are more predictable (half a page per allocation in average).

On x86, the physical address of the current active process's page table is stored in control register CR3. Our OS is responsible for loading and updating this register correctly. Page table format is specified by hardware architectural-level specification. Translation is done entirely by the CPU in hardware.

Multi-level Page Tables

Memory operations happen very frequently, so page table translation process needs to be blazingly fast. It would be rather inappropriate to implement a page table as a linked list or similar dynamic data structures - querying a dynamic data structure is too slow. We have to implement a page table as a "plain table". The processor can thus directly lookup the i-th entry in the table to get the mapping info of the i-th page.

A user process can be granted a very large virtual memory address space and it is probably very sparsely mapped. Using a single-level page table wastes a lot of space, since all free pages occupy an entry.

To alleviate this problem, we introduce multi-level paging. The basic idea is that we first divide the virtual memory space into larger memory regions consisting of many pages each. The top-level page table maps memory regions to handles of lower-level page-tables. The bottom-level page tables eventually map pages to physical frames. Translating a page now takes slightly longer (as we need to do multiple lookups downstream), but it greatly saves space ✭.

By convention, we denote the lowest-level page tables as level 1 (L1). See this figure by Philipp as a two-level paging example:

In this example, memory regions starting from 10 000 upto 1 000 000 are completely free (all pages within them are free). Therefore, no L1 page tables are created for them.

We can expand such design to three-level or four-level page tables. With more levels, we save more space, but translation also takes longer. This is a typical tradeoff in OS design.

Translation Lookaside Buffer (TLB)

Every address a user process sees is virtual. Therefore, address translation happens for every user memory access. If every page table lookup goes to memory, this will be prohibitively slow. Multi-level paging makes address translation even slower due to one memory access per level of lookup. To boost performance of the memory management unit, that hardware MMU uses a translation lookaside buffer (TLB). Think of it as an on-chip SRAM cache of recently looked-up pages.

On x86, TLB is also hardware-specific and the CPU will implicitly interact with it when using paging as the memory management scheme. TLB is yet not fully transparent - Our OS must manually update it whenever updating a page table. This can be done by either:

  • Issuing an invlpg instruction to invalidate a page
  • Reloading the CR3 register to simulate a complete address space switch

Progress So Far

This chapter is purely theoretical so we are not doing any coding here. We will soon start concretizing these theories in the following chapters!

Current repo structure:

hux-kernel
├── Makefile
├── scripts
│   ├── gdb_init
│   ├── grub.cfg
│   └── kernel.ld
├── src
│   ├── boot
│   │   ├── boot.s
│   │   ├── elf.h
│   │   └── multiboot.h
│   ├── common
│   │   ├── debug.c
│   │   ├── debug.h
│   │   ├── port.c
│   │   ├── port.h
│   │   ├── printf.c
│   │   ├── printf.h
│   │   ├── string.c
│   │   ├── string.h
│   │   ├── types.c
│   │   └── types.h
│   ├── device
│   │   ├── keyboard.c
│   │   ├── keyboard.h
│   │   ├── timer.c
│   │   └── timer.h
│   ├── display
│   │   ├── terminal.c
│   │   ├── terminal.h
│   │   └── vga.h
│   ├── interrupt
│   │   ├── idt-load.s
│   │   ├── idt.c
│   │   ├── idt.h
│   │   ├── isr-stub.s
│   │   ├── isr.c
│   │   └── isr.h
│   ├── memory
│   │   ├── gdt-load.s
│   │   ├── gdt.c
│   │   └── gdt.h
│   └── kernel.c