Files and Memory Wiki - brown-cs1690/handout GitHub Wiki
Files and Memory Wiki
This wiki page is used to offer a big picture of the file and memory system of Weenix and help tie all of the different pieces of the memory subsystem of Weenix together. This is still very much a living document, so if there is anything that you feel that is missing or could be better clarified, please do not hesitate to post on EdStem.
VFS
Vnodes and Files
What is a vnode?
- They are objects in the kernel that represent an object in the file system and they have a variety of operations (the field
vn_ops
) that manipulate the data that they represent: reading, writing, linking, etc. - They can represent files, directories, links, block devices, and character devices. Depending on what they represent, their vnode operations and the associated routines may vary
What is the difference between a vnode_t
and file_t
?
- A file is opened by a process by calling
open
(which will eventually call yourdo_open
implementation invfs_syscall.c
). Withindo_open
, afile_t
is set up in the process's file descriptor table. Each file will have a pointer to its corresponding vnode. - A
file_t
is specific to the process and is referenced by a file descriptor. Vnodes, on the other hand, are shared among processes.
Maintenance of vnodes
Why is reference counting necessary for vnodes?
- Reference counting is important for maintaining a count of all the places where a vnode is currently being used, since they are shared by many processes. Once the reference count of a vnode drops to 0, it can be cleaned up and its memory reused.
When should the reference count of a vnode be incremented?
- One place that it should be incremented when you are copying the vnode to a pointer that might outlive the original pointer. It also indicates that the vnode is in use.
How is a vnode’s reference count different from that of a file_t
?
- A
file_t
's reference count indicates how many times afile_t
has been referenced—in other words, the number of file descriptor entries that refer to it. The reference count of its corresponding vnode will increase when the file is created and it will decrease when the file's reference count drops to zero.
When should a vnode be locked?
- You should be locking a vnode when invoking one of its operations. Note that when you lock a vnode, you aren't actually locking the vnode itself, rather you are locking the memory object that underlies a vnode. This is so that access to the memory object's list of page frames will be protected in the event that an operation blocks (when interacting with disk).
S5FS
Inodes and Vnodes
How do vnodes relate to inodes?
- Inodes are the file system counterpart to the vnodes. They are specific to the file system and they are stored on disk (whereas vnodes are stored in kernel memory).
What are inodes?
- They represent files that are stored on disk, and inodes themselves are stored on disk. The inode keeps track of the data blocks associated with it using a list of block numbers (some stored in the inode itself (direct blocks) and some stored in a separate "indirect" block) that can point to blocks scattered throughout the disk.
Do the direct blocks and indirect block actually store the data that the inode represents?
- No, these are just block numbers that refer to where the data is stored on disk. You will have to call
s5_get_disk_block
and pass in the disk block number to get a page frame with the data.
What are s5_node_t
s? How does it contain the vnode and the inode?
s5_node_t
s contain both an inode and a vnode (and whether the inode is dirty) that correspond to the same filesystem object. Seefs_vnode_allocator
ins5fs_mount
—it's initialized to allocate objects the size of ans5_node_t
, which means when a vnode is created in__vget
, it will be the correct size. The inode and vnode of thes5_node_t
are initialized withins5_read_vnode
.
When should inodes be marked as dirty?
- Marking an inode as dirty means that any changes you have made will eventually be flushed to disk (on shutdown), but prior to that the disk remains unmodified (which means that if the system crashes, these changes will be lost). When you rerun Weenix (after halting), you should be able to see the changes that were made to the file system.
The inode should only be marked dirty when the metadata of the file that it represents (e.g. the length of the file, link count, the direct blocks, the indirect block number, etc.) not when the data of the file that it represents has been modified.
The changes made to the inode will be written to the page frame where the inode is stored when the file system’s delete vnode routine is called, which happens when its reference count is down to 0, but if the inode's link count is greater than 0.
If the inode's link count is
== 0
, then the inode is freed.
Memory Objects
What are sparse blocks?
- Blocks of all zeros (i.e. that do not contain any data) are considered "sparse." In an inode's direct blocks or in the indirect block, they are denoted with a disk block number of 0.
What are mobj_t
's?
- These are operating system abstractions for the memory cache.
There are a couple different types of memory objects (seemobj_type_t
inmobj.h
):- Vnode (
MOBJ_VNODE
) → Used to cache the sparse blocks of files. - Anonymous (
MOBJ_ANON
) → Used for memory regions such as the stack or the heap which do not have a file on disk backing them - Shadow (
MOBJ_SHADOW
) → Used for copy-on-write when implementingfork
. The "backing" mobj is one of the other types of memory objects. - Block devices (
MOBJ_BLOCKDEV
) → Used for caching blocks from the disk in memory.
- Vnode (
- In S5FS, you will be working with the memory objects for vnodes and block devices. And in VM, in addition to what is being used in S5FS, you will also be using shadow objects and anonymous objects.
The operations of
mobj
s include:get_pframe
→ this is used to retrieve a page frame associated with a page number.fill_pframe
→ a routine that initializes the contents of the page frame. This is called within get_pframe in the case that the page frame doesn’t currently exist in memory.flush_pframe
→ flushes the page frame to disk (or does nothing).destructor
→ cleans up the memory object and flushes each page frame.
- These operations are initialized in
mobj_init
. In the case of vnodes, themo_ops
fields is initialized withvnode_mobj_ops
, which are actually just wrappers for the associatedvn_ops
for vnodes. Thesevn_ops
are initialized in the file system’sread_vnode
routine.
What are page frames? And how do they relate to memory objects?
- Each page frame is the same size as a block of data, which is 4096 bytes, and they are where the data of the memory object resides. A memory object maintains a linked list of page frames. The data of a page frame is referred to by its
pf_addr
field. Page frames are allocated on demand, so when a memory object is first created, it has a list of empty page frames. - The
pagenum
field within a page frame is the unique identifier for each page frame within a memory object: it can indicate the corresponding disk block number in the event that the page frame is non-sparse and represents data on disk, it can also be the file/data block number (sparse blocks of a file). - pframes are retrieved with a call to
mobj_get_pframe
. In the event that theget_pframe
operation for that particular type ofmobj
isNULL
, thenmobj_default_get_pframe
is used and the correspondingfill_pframe
function is called to fill the data (for instance, filling data with all 0s, reading data from disk, or iterating down the shadow chain). For vnodes, this operation isvnode_get_pframe
, which will propagate to a call tos5fs_get_pframe
.
What synchronization primitives are used with memory objects and page frames?
- Each memory object has a kmutex associated with it. This is to prevent concurrent access to the memory object – for instance, when a thread calls fill_pframe which eventually leads to a disk operation, the thread will then sleep and wait for the disk interrupt to occur. Another thread will run in the meantime and could potentially be modifying the memory object or accessing the same page frame.
- (There is a page frame mutex to allow for more concurrency, but that is not really supported currently in Weenix.)
When should a page frame be marked as dirty?
- When you are writing to it — i.e. when the
forwrite
flag is specified. Setting thepf_dirty
field should already be handled inmobj_default_get_pframe
.
How is a file block number translated into a disk block number? What is the difference between the two?
- A file block number indicates which block of the file you are referring to from the offset of the start of the file. For instance, if you were at an offset of 5000 bytes into the file, you would be within file block number 1 (5000 / 4096 is equal to 1). Because file blocks are not necessarily stored contiguously (one next to the other) on disk, the file block number needs to be translated into a disk block number before it can be accessed. To convert between the two, you will use the routine
s5_file_block_to_disk_block
(which itself makes use of the direct block numbers stored in the inode as well as its indirect block to lookup the corresponding disk block, its behavior is also different depending on whether you are writing or reading to the block).
How do memory objects relate to what is stored on disk?
- The disk on Weenix is represented as a block device, which is a device that accesses data in terms of blocks (in comparison to character devices that transfer/receive a single character/byte at a time)
The routines that you have written in
sata.c
(or that have been written for you) will eventually be called any time the disk is accessed.sata_read_block
andsata_write_block
will read from and write data to disk, which will be called with calls toblockdev_fill_pframe
. Anything from the disk that is cached in RAM will be in the block device's memory object. In other words, any disk block that is not sparse will be cached in the block device's memory object. When Weenix is shutting down, all the data blocks that have been modified will be flushed to disk usingblockdev_flush_pframe
.
When are page frames flushed to disk?
- One place would be when page frames are written back to disk when the vnode is being destroyed in
vnode_destructor
. Ifpf_dirty
is set, then the memory object'sflush_pframe
operation will be called. However, because sparse blocks (as stored in a page frame) cached in vnodes will not have thepf_dirty
flag set, they will not be flushed. The block device's memory object is flushed whens5fs_sync
is called withins5fs_umount
(which will be called invfs_shutdown
).
Difference between the file system's memory object and the vnode's memory object?
- The file system memory object will store non-sparse cached data blocks from disk. The vnode's memory object will store sparse page frames.
Here is an example call stack when s5fs_get_file_block
is called:
VM (1690/2670 only)
Vmareas and Memory Objects
What is the difference between a vmarea and a memory object?
- A vmarea, as the name suggests, represents an area of virtual memory. And instead of keeping track of addresses in terms of address granularity, vmareas do so at the page granularity. This is because it's difficult to maintain permissions at the level of address granularity.
- Each page represents 4096 bytes of memory.
- It keeps track of the starting and ending page, permissions of that particular area, and its corresponding memory object.
- The memory object field of the vmarea is a pointer to its memory object (actual backing for this region of addresses: file, device, the stack etc.)
- The memory objects that you'll be working with include shadow objects (used to facilitate copy-on-write with fork) and anonymous objects (used to back regions of memory such as the stack and heap). Both shadow and anonymous memory objects do not need to be flushed to disk.
How do the addresses that the memory objects use relate to userland addresses? Are the addresses that the kernel access virtual addresses? For instance, is pf->pf_addr
a virtual or physical address?
pf_addr
is a virtual address. This is because the processor is unaware of whether the addresses that it's operating on are virtual or the actual physical address. Weenix uses a memory management unit (MMU) and has enabled paging, which means that all addresses that the processor accesses are translated behind the scenes using the TLB (and pagetables). As you might notice, we never get a page fault for addresses in the kernel, and this is because all kernel addresses are already mapped into the page tables initially (and have corresponding physical addresses in RAM).- After handling the page fault, you effectively have two different virtual addresses that correspond to the same physical address. The two virtual addresses are the address of
pf_addr
and the user accessed virtual address. This is so that the kernel has a convenient means of referring to all its memory, but is also able to have some of memory be mapped by user processes.
What is the point of using vmareas? How do they relate to pagetables?
- Each address space is composed of vmareas and each of them of represent/associated with a portion of the address space in which a memory object is mapped in a specific way – for instance, the corresponding memory object could have read and write permissions and be mapped as shared.
- This means that for each virtual address, it can be determined which vmarea the virtual address corresponds to and the associated memory object, as well as how it is mapped.
- We use vmareas to access the memory object that corresponds to the virtual address being accessed (as seen in
handle_pagefault
). They allow a logical way of viewing the address space when a pagefault does occur and the pagetable needs to be updated. - This allows memory allocation to happen only when memory is accessed instead of having to do so in advance. In other words, using vmareas allows the OS to only allocate memory only on demand. Thus, a process can be allowed to
malloc
some huge amount of space (perhaps more than the physical amount available), but behind the scenes, the actual physical memory will only be allocated once an address within this region is being used.
What does it mean for a file to be mapped in the address space?
- In order to do this, the user process will make a call to
mmap
, which is a system call. This will eventually propagate to call tovmmap_map()
. mmap
will return a pointer to the start of this region, thus when a process accesses an address in this region, they will be accessing (i.e. reading and writing to) the file directly that is mapped to this region. In other words, we do not have to make read/write system calls in order to read or write to the file. Here is an example of this:
char* mapped_region = mmap(....some file....); // mapped some file starting from page 2 (zero indexing)
*mapped_region = 'a'; // this would modify the first byte of page 2 of the file
- The visibility of the changes made to the file and whether they will be flushed / discarded depends on the flags passed in to
mmap
. If a file is mapped privately, then the changes will not be visible to other processes, which means that if two processes map the same file, their changes would be specific and private to the process, and will not be flushed to the underlying file stored on disk.
Why exactly do you need an offset into a memory object? In other words, why is vma_off
necessary?
- When you map in a file into the address space, it isn't necessarily the case that the mapping needs to start from location 0. When accessing this file through using a (virtual) address in this region you need to figure out how much to offset into the file relative to the starting address of the vmarea.
- Here is an example: if file is mapped starting from page 5 (remember that each page contains 4096 bytes). And let's say that the virtual memory region that the file is being mapped to ranges from page number 2 to 5 exclusive (these are not real numbers). Thus if an address
x
contained within virtual page 3 is accessed, then when calculating the page number to find the corresponding page frame in the memory object that contains this addressx
, we must first subtract the starting virtual page number of this region, which tells how much to "offset" into the memory object. Then once we have determined this offset, because we mapped the file from page 5, the page number of the file that is being accessed would be 5 + offset, which in this case would be 5 + (3 - 2) = 6.
How are virtual page numbers related to page frames?
- Virtual page numbers are virtual addresses that are bit shifted to the right by 12. Each virtual page number corresponds to a region of 4096 addresses. This page number is relative to the beginning of the address space.
- Page frame numbers, depending on what the backing store is, can be thought of as an "offset" into the memory object. This page frame number is relative to the beginning of the mapped object.
When are the vmmap operations being called? When are vmmap_read
and vmmap_write
being called?
- These operations are being used when data needs to be copied from/to user space. Doing so this way (i.e. accessing the memory through the corresponding memory objects for each vmarea) would avoid causing pagefaults for userland addresses when in the kernel. This is why in system calls, you must used
copy_from_user
andcopy_to_user
.
Pagetables in Weenix
What is the physmap
region of memory?
- The physmap region includes all of physical memory and provides a one to one mapping between virtual addresses and physical addresses (to get the physical address an offset it subtracted from the virtual address).
What is initially mapped into pagetables?
- The kernel is mapped in initially as well as the physmap region (this effectively means that the kernel is mapped into memory twice).
- When a pagetable is created with
pt_create
, the pagetables is copied from the parent, excluding the user mappings (thus, when a child process is created withfork
, it has a new page table that doesn't include the user mappings).
Memory allocation in Weenix
You might have noticed that when allocating a struct, for instance, a proc
or a vnode
, the routine slab_obj_alloc
is called, passing in an allocator. Each struct or object in Weenix has a designated slab_allocator_t
pointer. Each slab_allocator_t
keeps track of a list of slabs (sa_slabs which is a slab pointer), object size, the order of magnitude which to allocate pages where the slab is stored, number of objects per slab.
- The object size includes the 2 REDZONE areas. The REDZONE areas are set to be some value to check whether memory overwrites have occurred.
Each slab keeps track of the link to the next slab, and the number of objects in use in the slab, and a pointer to the free list of objects within the slabs.
- Allocator → manages slabs
- Slabs → manages objects within a slab
slab_obj_alloc
- Iterates through the list of slabs to find if there is one that still has free objects (i.e. number of objects in use < number of objects per slab)
- If there are no slabs available, then a new slab is allocated using slab_obj_grow
If a slab does have free objects, then using the free list, an object is removed and the slab’s s_free pointer is updated to be the (formerly) free object’s sb_next which is stored in the object’s bufctl
- The object’s bufctl’s sb_slab field is updated to be the slab
- The slab’s number of objects in use is increased
- The pointer to the object is returned (which is past the REDZONE)
slab_obj_free
- The object’s REDZONE is verified to be unmodified
- The object’s slab is used to place the newly freed object as the head of the free list, thus updating the slab’s s_free field
- And the slab’s s_inuse field is decremented to reflect the increase in the number of free objects
slab_obj_grow
- First allocate number of pages for the slab
- And for each object in the slab, initialize the free list pointers to create a singly linked list between the free objects
- For each object, initialize the red zones
- Insert this newly allocated slab into the allocator’s slab list as the head
slab_allocator_create
- Creates an instance of a slab_allocator_t
You might notice that some routines in Weenix, utilize the function kmalloc
. When slab_init
is called, the first thing that is allocated is an allocator that manages slabs that allocate slab_allocator_ts. Then, the kmalloc allocators are created for different sizes: starting from 64 bytes to 262144 bytes.
kmalloc
- Iterate through the different magnitude of kmalloc allocators and once a large enough allocator has been found,
slab_obj_alloc
is called. - Then at the start of this region, a pointer to the allocator used is stored
- A pointer is returned after where the pointer to the allocator is stored
kfree
- The allocator is retrieved and then slab_obj_free is called passing in the allocator and the address