xv6Disk - ccc-sp/riscv2os GitHub Wiki
xv6: 磁碟驅動與 virtio 交互協議
參考
- virtio详细介绍和1.1新功能
- virtio: Towards a De-Facto Standard For Virtual I/O Devices (PDF), Rusty Russell, IBM OzLabs
- 什麼是QEMU ? 什麼是KVM ? 什麼是QEMU-KVM?
- KVM實際是linux內核提供的虛擬化架構,可將內核直接充當hypervisor來使用。KVM本身不實現任何模擬,僅僅是暴露了一個/dev/kvm接口,這個接口可被宿主機用來主要負責vCPU的創建,虛擬內存的地址空間分配,vCPU寄存器的讀寫以及vCPU的運行。在QEMU-KVM中,KVM運行在內核空間,QEMU運行在用户空間,實際模擬創建、管理各種虛擬硬件,KVM加上QEMU後就是完整意義上的服務器虛擬化。QEMU-KVM具有兩大作用 1. 提供對cpu,內存(KVM負責),IO設備(QEMU負責)的虛擬 2. 對各種虛擬設備的創建,調用進行管理(QEMU負責)
簡介
作業系統 (例如 xv6) 在 Hypervisor (例如 qemu) 中執行,xv6 的驅動程式 driver (virtio_disk.c) 會使用 virtio 協定與 qemu 中模擬出的硬碟 device 溝通。
driver 和 device 是生產者/消費者模式, driver 生產資料,寫入記憶體,device 消費資料,從記憶體中讀取。
- Doorbell: xv6 driver 寫入特定記憶體,要求 qemu device 讀取資料區塊
- 中斷: qemu device 讀完後發中斷給 xv6,讓 xv6 來取回 (消費) 資料。
在生產者/消費者模式中,需要一個環狀佇列 (Circular Queue),在 virtio 裡稱為 virt_queue ,其相關資料結構如下:
kernel/virtio.h
// a single descriptor, from the spec.
struct virtq_desc {
uint64 addr;
uint32 len;
uint16 flags;
uint16 next;
};
#define VRING_DESC_F_NEXT 1 // chained with another descriptor
#define VRING_DESC_F_WRITE 2 // device writes (vs read)
// the (entire) avail ring, from the spec.
struct virtq_avail {
uint16 flags; // always zero
uint16 idx; // driver will write ring[idx] next
uint16 ring[NUM]; // descriptor numbers of chain heads
uint16 unused;
};
// one entry in the "used" ring, with which the
// device tells the driver about completed requests.
struct virtq_used_elem {
uint32 id; // index of start of completed descriptor chain
uint32 len;
};
struct virtq_used {
uint16 flags; // always zero
uint16 idx; // device increments when it adds a ring[] entry
struct virtq_used_elem ring[NUM];
};
kernel/virtio_disk.c
static struct disk {
// the virtio driver and device mostly communicate through a set of
// structures in RAM. pages[] allocates that memory. pages[] is a
// global (instead of calls to kalloc()) because it must consist of
// two contiguous pages of page-aligned physical memory.
char pages[2*PGSIZE];
// pages[] is divided into three regions (descriptors, avail, and
// used), as explained in Section 2.6 of the virtio specification
// for the legacy interface.
// https://docs.oasis-open.org/virtio/virtio/v1.1/virtio-v1.1.pdf
// the first region of pages[] is a set (not a ring) of DMA
// descriptors, with which the driver tells the device where to read
// and write individual disk operations. there are NUM descriptors.
// most commands consist of a "chain" (a linked list) of a couple of
// these descriptors.
// points into pages[].
struct virtq_desc *desc;
// next is a ring in which the driver writes descriptor numbers
// that the driver would like the device to process. it only
// includes the head descriptor of each chain. the ring has
// NUM elements.
// points into pages[].
struct virtq_avail *avail;
// finally a ring in which the device writes descriptor numbers that
// the device has finished processing (just the head of each chain).
// there are NUM used ring entries.
// points into pages[].
struct virtq_used *used;
// our own book-keeping.
char free[NUM]; // is a descriptor free?
uint16 used_idx; // we've looked this far in used[2..NUM].
// track info about in-flight operations,
// for use when completion interrupt arrives.
// indexed by first descriptor index of chain.
struct {
struct buf *b;
char status;
} info[NUM];
// disk command headers.
// one-for-one with descriptors, for convenience.
struct virtio_blk_req ops[NUM];
struct spinlock vdisk_lock;
} __attribute__ ((aligned (PGSIZE))) disk;
運作方式
- 初始化: xv6 分配 virt queue,並告訴 qemu virt queue 的位址。
- 交互: Virtio 的三種交互模式,分別是 PCI (主機),MMIO (嵌入式) 與 Channel I/O (很少見),以下是 xv6 的 virtio 相關程式碼,採用 MMIO 的交互模式。
初始化步驟,請參考 Virtual I/O Device (VIRTIO) Version 1.1 -- Section 3.1 Device Initialization
3.1 Device Initialization 3.1.1 Driver Requirements: Device Initialization
The driver MUST follow this sequence to initialize a device:
- Reset the device.
- Set the ACKNOWLEDGE status bit: the guest OS has noticed the device.
- Set the DRIVER status bit: the guest OS knows how to drive the device.
- Read device feature bits, and write the subset of feature bits understood by the OS and driver to the device. During this step the driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device before accepting it.
- Set the FEATURES_OK status bit. The driver MUST NOT accept new feature bits after this step.
- Re-read device status to ensure the FEATURES_OK bit is still set: otherwise, the device does not support our subset of features and the device is unusable.
- Perform device-specific setup, including discovery of virtqueues for the device, optional per-bus setup, reading and possibly writing the device’s virtio configuration space, and population of virtqueues.
- Set the DRIVER_OK status bit. At this point the device is “live”.
If any of these steps go irrecoverably wrong, the driver SHOULD set the FAILED status bit to indicate that it has given up on the device (it can reset the device later to restart if desired). The driver MUST NOT continue initialization in that case.
The driver MUST NOT send any buffer available notifications to the device before setting DRIVER_OK.
xv6 中的 virtio 記憶體區域配置在 1001000 開始的位址:
kernel/memlayout.h
// 00001000 -- boot ROM, provided by qemu // 啟動的 ROM 區塊
// 02000000 -- CLINT // 局部中斷 Core Local Interruptor
// 0C000000 -- PLIC // 外部中斷 Platform-Level Interrupt Controller
// 10000000 -- uart0 // UART: 反映在宿主機的 Console 上,代表終端機
// 10001000 -- virtio disk // 磁碟 virtio 區域
// 80000000 -- boot ROM jumps here in machine mode // 啟動後會跳到這裡
// -kernel loads the kernel here
// unused RAM after 80000000.
kernel/virtio_disk.c
// the address of virtio mmio register r.
#define R(r) ((volatile uint32 *)(VIRTIO0 + (r)))
// 參考 [Virtual I/O Device (VIRTIO) Version 1.1 -- 4.2.2 MMIO Device Register Layout](https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-1460002)
void
virtio_disk_init(void) // 初始化本模組
{
uint32 status = 0;
initlock(&disk.vdisk_lock, "virtio_disk");
// 檢查是否有 MMIO 磁碟裝置存在?
if(*R(VIRTIO_MMIO_MAGIC_VALUE) != 0x74726976 ||
*R(VIRTIO_MMIO_VERSION) != 1 ||
*R(VIRTIO_MMIO_DEVICE_ID) != 2 ||
*R(VIRTIO_MMIO_VENDOR_ID) != 0x554d4551){
panic("could not find virtio disk");
}
// 設定 VIRTIO_MMIO 狀態
status |= VIRTIO_CONFIG_S_ACKNOWLEDGE;
*R(VIRTIO_MMIO_STATUS) = status;
status |= VIRTIO_CONFIG_S_DRIVER;
*R(VIRTIO_MMIO_STATUS) = status;
// negotiate features // 設定 feature 準備開始協商
uint64 features = *R(VIRTIO_MMIO_DEVICE_FEATURES);
features &= ~(1 << VIRTIO_BLK_F_RO);
features &= ~(1 << VIRTIO_BLK_F_SCSI);
features &= ~(1 << VIRTIO_BLK_F_CONFIG_WCE);
features &= ~(1 << VIRTIO_BLK_F_MQ);
features &= ~(1 << VIRTIO_F_ANY_LAYOUT);
features &= ~(1 << VIRTIO_RING_F_EVENT_IDX);
features &= ~(1 << VIRTIO_RING_F_INDIRECT_DESC);
*R(VIRTIO_MMIO_DRIVER_FEATURES) = features;
// tell device that feature negotiation is complete.
status |= VIRTIO_CONFIG_S_FEATURES_OK; // feature 設定好了
*R(VIRTIO_MMIO_STATUS) = status;
// tell device we're completely ready. // 完成準備,要求 QEMU 協商
status |= VIRTIO_CONFIG_S_DRIVER_OK;
*R(VIRTIO_MMIO_STATUS) = status;
*R(VIRTIO_MMIO_GUEST_PAGE_SIZE) = PGSIZE;
// initialize queue 0.
*R(VIRTIO_MMIO_QUEUE_SEL) = 0;
uint32 max = *R(VIRTIO_MMIO_QUEUE_NUM_MAX);
if(max == 0)
panic("virtio disk has no queue 0");
if(max < NUM)
panic("virtio disk max queue too short");
*R(VIRTIO_MMIO_QUEUE_NUM) = NUM;
memset(disk.pages, 0, sizeof(disk.pages));
*R(VIRTIO_MMIO_QUEUE_PFN) = ((uint64)disk.pages) >> PGSHIFT; // PFN: Guest physical page number of the virtual queue
// desc = pages -- num * virtq_desc
// avail = pages + 0x40 -- 2 * uint16, then num * uint16
// used = pages + 4096 -- 2 * uint16, then num * vRingUsedElem
disk.desc = (struct virtq_desc *) disk.pages;
disk.avail = (struct virtq_avail *)(disk.pages + NUM*sizeof(struct virtq_desc));
disk.used = (struct virtq_used *) (disk.pages + PGSIZE);
// all NUM descriptors start out unused. // 設定所有描述子均為可用
for(int i = 0; i < NUM; i++)
disk.free[i] = 1;
// plic.c and trap.c arrange for interrupts from VIRTIO0_IRQ.
}
其中 disk.desc, avail, used 的 alignment 規定如下圖:

讀寫前要先分配連續空間的三個描述子,然後呼叫 virtio_disk_rw() 函數
void
virtio_disk_rw(struct buf *b, int write) // 啟動 virtio 的磁碟寫入動作
{
uint64 sector = b->blockno * (BSIZE / 512);
acquire(&disk.vdisk_lock);
// the spec's Section 5.2 says that legacy block operations use
// three descriptors: one for type/reserved/sector, one for the
// data, one for a 1-byte status result.
// allocate the three descriptors. // 分配直到成功為止
int idx[3];
while(1){
if(alloc3_desc(idx) == 0) {
break;
}
sleep(&disk.free[0], &disk.vdisk_lock);
}
// format the three descriptors.
// qemu's virtio-blk.c reads them.
// 參考 -- https://github.com/qemu/qemu/blob/master/hw/block/virtio-blk.c
struct virtio_blk_req *buf0 = &disk.ops[idx[0]]; // 讀寫的區塊
if(write) // 根據 write 來設定為《寫入或讀取》
buf0->type = VIRTIO_BLK_T_OUT; // write the disk // 讀寫類型為寫入
else
buf0->type = VIRTIO_BLK_T_IN; // read the disk // 讀寫類型為讀取
buf0->reserved = 0;
buf0->sector = sector; // 讀寫的磁區號碼 (sector)
// 第 0 個描述子
disk.desc[idx[0]].addr = (uint64) buf0;
disk.desc[idx[0]].len = sizeof(struct virtio_blk_req);
disk.desc[idx[0]].flags = VRING_DESC_F_NEXT;
disk.desc[idx[0]].next = idx[1];
// 第 1 個描述子
disk.desc[idx[1]].addr = (uint64) b->data;
disk.desc[idx[1]].len = BSIZE;
if(write)
disk.desc[idx[1]].flags = 0; // device reads b->data
else
disk.desc[idx[1]].flags = VRING_DESC_F_WRITE; // device writes b->data
disk.desc[idx[1]].flags |= VRING_DESC_F_NEXT;
disk.desc[idx[1]].next = idx[2];
// 第 2 個描述子
disk.info[idx[0]].status = 0xff; // device writes 0 on success
disk.desc[idx[2]].addr = (uint64) &disk.info[idx[0]].status;
disk.desc[idx[2]].len = 1;
disk.desc[idx[2]].flags = VRING_DESC_F_WRITE; // device writes the status
disk.desc[idx[2]].next = 0;
// record struct buf for virtio_disk_intr().
b->disk = 1;
disk.info[idx[0]].b = b;
// tell the device the first index in our chain of descriptors.
disk.avail->ring[disk.avail->idx % NUM] = idx[0];
__sync_synchronize();
// tell the device another avail ring entry is available.
disk.avail->idx += 1; // not % NUM ...
__sync_synchronize();
*R(VIRTIO_MMIO_QUEUE_NOTIFY) = 0; // value is queue number
// Wait for virtio_disk_intr() to say request has finished.
while(b->disk == 1) { // b->disk=1 代表磁碟正在讀取到緩衝區 buf
sleep(b, &disk.vdisk_lock);
}
// 讀完了,釋放 idx[0]
disk.info[idx[0]].b = 0;
free_chain(idx[0]);
release(&disk.vdisk_lock);
}
這時就會交由 qemu 去讀寫,等到讀寫完成之後, qemu 會發中斷給 xv6 ,然後 xv6 會執行如下的 virtio_disk_intr() 函數
void
virtio_disk_intr() // virtio_disk_rw() 請求讀寫,完成後 qemu 會發中斷給 xv6
{
acquire(&disk.vdisk_lock);
// the device won't raise another interrupt until we tell it
// we've seen this interrupt, which the following line does.
// this may race with the device writing new entries to
// the "used" ring, in which case we may process the new
// completion entries in this interrupt, and have nothing to do
// in the next interrupt, which is harmless.
*R(VIRTIO_MMIO_INTERRUPT_ACK) = *R(VIRTIO_MMIO_INTERRUPT_STATUS) & 0x3;
__sync_synchronize();
// the device increments disk.used->idx when it
// adds an entry to the used ring.
while(disk.used_idx != disk.used->idx){
__sync_synchronize();
int id = disk.used->ring[disk.used_idx % NUM].id;
if(disk.info[id].status != 0)
panic("virtio_disk_intr status");
struct buf *b = disk.info[id].b;
b->disk = 0; // disk is done with buf
wakeup(b); // 讀取完成,喚醒等待此磁碟事件的行程 (加入排程佇列)
disk.used_idx += 1;
}
release(&disk.vdisk_lock);
}
透過這樣的方式, xv6 和 QEMU 以 virtio 協議,合作完成了磁碟讀寫的動作。
當然,這只是 xv6 底層的讀寫驅動程式,整個 xv6 的檔案系統共包含了七個層次,請參考下列文章: