Files and Memory Wiki - brown-cs1690/handout GitHub Wiki

Files and Memory Wiki

This wiki page is used to offer a big picture of the file and memory system of Weenix and help tie all of the different pieces of the memory subsystem of Weenix together. This is still very much a living document, so if there is anything that you feel that is missing or could be better clarified, please do not hesitate to post on EdStem.

VFS

Vnodes and Files

What is a vnode?

They are objects in the kernel that represent an object in the file system and they have a variety of operations (the field vn_ops) that manipulate the data that they represent: reading, writing, linking, etc.
They can represent files, directories, links, block devices, and character devices. Depending on what they represent, their vnode operations and the associated routines may vary

What is the difference between a vnode_t and file_t?

A file is opened by a process by calling open (which will eventually call your do_open implementation in vfs_syscall.c). Within do_open, a file_t is set up in the process's file descriptor table. Each file will have a pointer to its corresponding vnode.
A file_t is specific to the process and is referenced by a file descriptor. Vnodes, on the other hand, are shared among processes.

Maintenance of vnodes

Why is reference counting necessary for vnodes?

Reference counting is important for maintaining a count of all the places where a vnode is currently being used, since they are shared by many processes. Once the reference count of a vnode drops to 0, it can be cleaned up and its memory reused.

When should the reference count of a vnode be incremented?

One place that it should be incremented when you are copying the vnode to a pointer that might outlive the original pointer. It also indicates that the vnode is in use.

How is a vnode’s reference count different from that of a file_t?

A file_t's reference count indicates how many times a file_t has been referenced—in other words, the number of file descriptor entries that refer to it. The reference count of its corresponding vnode will increase when the file is created and it will decrease when the file's reference count drops to zero.

When should a vnode be locked?

You should be locking a vnode when invoking one of its operations. Note that when you lock a vnode, you aren't actually locking the vnode itself, rather you are locking the memory object that underlies a vnode. This is so that access to the memory object's list of page frames will be protected in the event that an operation blocks (when interacting with disk).

S5FS

Inodes and Vnodes

How do vnodes relate to inodes?

Inodes are the file system counterpart to the vnodes. They are specific to the file system and they are stored on disk (whereas vnodes are stored in kernel memory).

What are inodes?

They represent files that are stored on disk, and inodes themselves are stored on disk. The inode keeps track of the data blocks associated with it using a list of block numbers (some stored in the inode itself (direct blocks) and some stored in a separate "indirect" block) that can point to blocks scattered throughout the disk.

Do the direct blocks and indirect block actually store the data that the inode represents?

No, these are just block numbers that refer to where the data is stored on disk. You will have to call s5_get_disk_block and pass in the disk block number to get a page frame with the data.

What are s5_node_ts? How does it contain the vnode and the inode?

s5_node_ts contain both an inode and a vnode (and whether the inode is dirty) that correspond to the same filesystem object. See fs_vnode_allocator in s5fs_mount—it's initialized to allocate objects the size of an s5_node_t, which means when a vnode is created in __vget, it will be the correct size. The inode and vnode of the s5_node_t are initialized within s5_read_vnode.

When should inodes be marked as dirty?

Marking an inode as dirty means that any changes you have made will eventually be flushed to disk (on shutdown), but prior to that the disk remains unmodified (which means that if the system crashes, these changes will be lost). When you rerun Weenix (after halting), you should be able to see the changes that were made to the file system. The inode should only be marked dirty when the metadata of the file that it represents (e.g. the length of the file, link count, the direct blocks, the indirect block number, etc.) not when the data of the file that it represents has been modified. The changes made to the inode will be written to the page frame where the inode is stored when the file system’s delete vnode routine is called, which happens when its reference count is down to 0, but if the inode's link count is greater than 0. If the inode's link count is == 0, then the inode is freed.

Memory Objects

What are sparse blocks?

Blocks of all zeros (i.e. that do not contain any data) are considered "sparse." In an inode's direct blocks or in the indirect block, they are denoted with a disk block number of 0.

What are mobj_t's?

These are operating system abstractions for the memory cache.
There are a couple different types of memory objects (see mobj_type_t in mobj.h):
- Vnode (MOBJ_VNODE) → Used to cache the sparse blocks of files.
- Anonymous (MOBJ_ANON) → Used for memory regions such as the stack or the heap which do not have a file on disk backing them
- Shadow (MOBJ_SHADOW) → Used for copy-on-write when implementing fork. The "backing" mobj is one of the other types of memory objects.
- Block devices (MOBJ_BLOCKDEV) → Used for caching blocks from the disk in memory.
In S5FS, you will be working with the memory objects for vnodes and block devices. And in VM, in addition to what is being used in S5FS, you will also be using shadow objects and anonymous objects. The operations of mobjs include:
- get_pframe → this is used to retrieve a page frame associated with a page number.
- fill_pframe → a routine that initializes the contents of the page frame. This is called within get_pframe in the case that the page frame doesn’t currently exist in memory.
- flush_pframe → flushes the page frame to disk (or does nothing).
- destructor → cleans up the memory object and flushes each page frame.
These operations are initialized in mobj_init. In the case of vnodes, the mo_ops fields is initialized with vnode_mobj_ops, which are actually just wrappers for the associated vn_ops for vnodes. These vn_ops are initialized in the file system’s read_vnode routine.

What are page frames? And how do they relate to memory objects?

Each page frame is the same size as a block of data, which is 4096 bytes, and they are where the data of the memory object resides. A memory object maintains a linked list of page frames. The data of a page frame is referred to by its pf_addr field. Page frames are allocated on demand, so when a memory object is first created, it has a list of empty page frames.
The pagenum field within a page frame is the unique identifier for each page frame within a memory object: it can indicate the corresponding disk block number in the event that the page frame is non-sparse and represents data on disk, it can also be the file/data block number (sparse blocks of a file).
pframes are retrieved with a call to mobj_get_pframe. In the event that the get_pframe operation for that particular type of mobj is NULL, then mobj_default_get_pframe is used and the corresponding fill_pframe function is called to fill the data (for instance, filling data with all 0s, reading data from disk, or iterating down the shadow chain). For vnodes, this operation is vnode_get_pframe, which will propagate to a call to s5fs_get_pframe.

What synchronization primitives are used with memory objects and page frames?

Each memory object has a kmutex associated with it. This is to prevent concurrent access to the memory object – for instance, when a thread calls fill_pframe which eventually leads to a disk operation, the thread will then sleep and wait for the disk interrupt to occur. Another thread will run in the meantime and could potentially be modifying the memory object or accessing the same page frame.
(There is a page frame mutex to allow for more concurrency, but that is not really supported currently in Weenix.)

When should a page frame be marked as dirty?

When you are writing to it — i.e. when the forwrite flag is specified. Setting the pf_dirty field should already be handled in mobj_default_get_pframe.

How is a file block number translated into a disk block number? What is the difference between the two?

A file block number indicates which block of the file you are referring to from the offset of the start of the file. For instance, if you were at an offset of 5000 bytes into the file, you would be within file block number 1 (5000 / 4096 is equal to 1). Because file blocks are not necessarily stored contiguously (one next to the other) on disk, the file block number needs to be translated into a disk block number before it can be accessed. To convert between the two, you will use the routine s5_file_block_to_disk_block (which itself makes use of the direct block numbers stored in the inode as well as its indirect block to lookup the corresponding disk block, its behavior is also different depending on whether you are writing or reading to the block).

How do memory objects relate to what is stored on disk?

The disk on Weenix is represented as a block device, which is a device that accesses data in terms of blocks (in comparison to character devices that transfer/receive a single character/byte at a time) The routines that you have written in sata.c (or that have been written for you) will eventually be called any time the disk is accessed. sata_read_block and sata_write_block will read from and write data to disk, which will be called with calls to blockdev_fill_pframe. Anything from the disk that is cached in RAM will be in the block device's memory object. In other words, any disk block that is not sparse will be cached in the block device's memory object. When Weenix is shutting down, all the data blocks that have been modified will be flushed to disk using blockdev_flush_pframe.

When are page frames flushed to disk?

One place would be when page frames are written back to disk when the vnode is being destroyed in vnode_destructor. If pf_dirty is set, then the memory object's flush_pframe operation will be called. However, because sparse blocks (as stored in a page frame) cached in vnodes will not have the pf_dirty flag set, they will not be flushed. The block device's memory object is flushed when s5fs_sync is called within s5fs_umount (which will be called in vfs_shutdown).

Difference between the file system's memory object and the vnode's memory object?

The file system memory object will store non-sparse cached data blocks from disk. The vnode's memory object will store sparse page frames.

Here is an example call stack when s5fs_get_file_block is called:

get_file_block image

VM (1690/2670 only)

Vmareas and Memory Objects

What is the difference between a vmarea and a memory object?

A vmarea, as the name suggests, represents an area of virtual memory. And instead of keeping track of addresses in terms of address granularity, vmareas do so at the page granularity. This is because it's difficult to maintain permissions at the level of address granularity.
- Each page represents 4096 bytes of memory.
- It keeps track of the starting and ending page, permissions of that particular area, and its corresponding memory object.
- The memory object field of the vmarea is a pointer to its memory object (actual backing for this region of addresses: file, device, the stack etc.)
The memory objects that you'll be working with include shadow objects (used to facilitate copy-on-write with fork) and anonymous objects (used to back regions of memory such as the stack and heap). Both shadow and anonymous memory objects do not need to be flushed to disk.

How do the addresses that the memory objects use relate to userland addresses? Are the addresses that the kernel access virtual addresses? For instance, is pf->pf_addr a virtual or physical address?

pf_addr is a virtual address. This is because the processor is unaware of whether the addresses that it's operating on are virtual or the actual physical address. Weenix uses a memory management unit (MMU) and has enabled paging, which means that all addresses that the processor accesses are translated behind the scenes using the TLB (and pagetables). As you might notice, we never get a page fault for addresses in the kernel, and this is because all kernel addresses are already mapped into the page tables initially (and have corresponding physical addresses in RAM).
After handling the page fault, you effectively have two different virtual addresses that correspond to the same physical address. The two virtual addresses are the address of pf_addr and the user accessed virtual address. This is so that the kernel has a convenient means of referring to all its memory, but is also able to have some of memory be mapped by user processes.

What is the point of using vmareas? How do they relate to pagetables?

Each address space is composed of vmareas and each of them of represent/associated with a portion of the address space in which a memory object is mapped in a specific way – for instance, the corresponding memory object could have read and write permissions and be mapped as shared.
This means that for each virtual address, it can be determined which vmarea the virtual address corresponds to and the associated memory object, as well as how it is mapped.
We use vmareas to access the memory object that corresponds to the virtual address being accessed (as seen in handle_pagefault). They allow a logical way of viewing the address space when a pagefault does occur and the pagetable needs to be updated.
This allows memory allocation to happen only when memory is accessed instead of having to do so in advance. In other words, using vmareas allows the OS to only allocate memory only on demand. Thus, a process can be allowed to malloc some huge amount of space (perhaps more than the physical amount available), but behind the scenes, the actual physical memory will only be allocated once an address within this region is being used.

What does it mean for a file to be mapped in the address space?

In order to do this, the user process will make a call to mmap, which is a system call. This will eventually propagate to call to vmmap_map().
mmap will return a pointer to the start of this region, thus when a process accesses an address in this region, they will be accessing (i.e. reading and writing to) the file directly that is mapped to this region. In other words, we do not have to make read/write system calls in order to read or write to the file. Here is an example of this:

char* mapped_region = mmap(....some file....); // mapped some file starting from page 2 (zero indexing)
*mapped_region = 'a'; // this would modify the first byte of page 2 of the file

The visibility of the changes made to the file and whether they will be flushed / discarded depends on the flags passed in to mmap. If a file is mapped privately, then the changes will not be visible to other processes, which means that if two processes map the same file, their changes would be specific and private to the process, and will not be flushed to the underlying file stored on disk.

Why exactly do you need an offset into a memory object? In other words, why is vma_off necessary?

When you map in a file into the address space, it isn't necessarily the case that the mapping needs to start from location 0. When accessing this file through using a (virtual) address in this region you need to figure out how much to offset into the file relative to the starting address of the vmarea.
Here is an example: if file is mapped starting from page 5 (remember that each page contains 4096 bytes). And let's say that the virtual memory region that the file is being mapped to ranges from page number 2 to 5 exclusive (these are not real numbers). Thus if an address x contained within virtual page 3 is accessed, then when calculating the page number to find the corresponding page frame in the memory object that contains this address x, we must first subtract the starting virtual page number of this region, which tells how much to "offset" into the memory object. Then once we have determined this offset, because we mapped the file from page 5, the page number of the file that is being accessed would be 5 + offset, which in this case would be 5 + (3 - 2) = 6.

How are virtual page numbers related to page frames?

Virtual page numbers are virtual addresses that are bit shifted to the right by 12. Each virtual page number corresponds to a region of 4096 addresses. This page number is relative to the beginning of the address space.
Page frame numbers, depending on what the backing store is, can be thought of as an "offset" into the memory object. This page frame number is relative to the beginning of the mapped object.

When are the vmmap operations being called? When are vmmap_read and vmmap_write being called?

These operations are being used when data needs to be copied from/to user space. Doing so this way (i.e. accessing the memory through the corresponding memory objects for each vmarea) would avoid causing pagefaults for userland addresses when in the kernel. This is why in system calls, you must used copy_from_user and copy_to_user.

Pagetables in Weenix

What is the physmap region of memory?

The physmap region includes all of physical memory and provides a one to one mapping between virtual addresses and physical addresses (to get the physical address an offset it subtracted from the virtual address).

What is initially mapped into pagetables?

The kernel is mapped in initially as well as the physmap region (this effectively means that the kernel is mapped into memory twice).
When a pagetable is created with pt_create, the pagetables is copied from the parent, excluding the user mappings (thus, when a child process is created with fork, it has a new page table that doesn't include the user mappings).

Memory allocation in Weenix

You might have noticed that when allocating a struct, for instance, a proc or a vnode, the routine slab_obj_alloc is called, passing in an allocator. Each struct or object in Weenix has a designated slab_allocator_t pointer. Each slab_allocator_t keeps track of a list of slabs (sa_slabs which is a slab pointer), object size, the order of magnitude which to allocate pages where the slab is stored, number of objects per slab.

The object size includes the 2 REDZONE areas. The REDZONE areas are set to be some value to check whether memory overwrites have occurred.

Each slab keeps track of the link to the next slab, and the number of objects in use in the slab, and a pointer to the free list of objects within the slabs.

Allocator → manages slabs
Slabs → manages objects within a slab

slab_obj_alloc image

slab_obj_alloc

Iterates through the list of slabs to find if there is one that still has free objects (i.e. number of objects in use < number of objects per slab)
If there are no slabs available, then a new slab is allocated using slab_obj_grow If a slab does have free objects, then using the free list, an object is removed and the slab’s s_free pointer is updated to be the (formerly) free object’s sb_next which is stored in the object’s bufctl
- The object’s bufctl’s sb_slab field is updated to be the slab
The slab’s number of objects in use is increased
The pointer to the object is returned (which is past the REDZONE)

slab_obj_free

The object’s REDZONE is verified to be unmodified
The object’s slab is used to place the newly freed object as the head of the free list, thus updating the slab’s s_free field
And the slab’s s_inuse field is decremented to reflect the increase in the number of free objects

slab_obj_grow

First allocate number of pages for the slab
And for each object in the slab, initialize the free list pointers to create a singly linked list between the free objects
For each object, initialize the red zones
Insert this newly allocated slab into the allocator’s slab list as the head

slab_allocator_create

Creates an instance of a slab_allocator_t

You might notice that some routines in Weenix, utilize the function kmalloc. When slab_init is called, the first thing that is allocated is an allocator that manages slabs that allocate slab_allocator_ts. Then, the kmalloc allocators are created for different sizes: starting from 64 bytes to 262144 bytes.

kmalloc

Iterate through the different magnitude of kmalloc allocators and once a large enough allocator has been found, slab_obj_alloc is called.
Then at the start of this region, a pointer to the allocator used is stored
A pointer is returned after where the pointer to the allocator is stored

kfree

The allocator is retrieved and then slab_obj_free is called passing in the allocator and the address