QEMU execution flow - adava/DECAF-Selective GitHub Wiki

In this page, we explain the high level ideas (explained in the Qemu page) by walking through the code. We start our explanation from the Qemu “main” function. The main execution loop of QEMU is in a function named cpu_exec. This function is called in a subtle way. The call trace to this function starts from pc_init_pci function for x86 architecture. This function itself is the init function for an abstraction object called QEMUMachine. After initialization, Qemu calls this function in the “main” through a machine->init call (line 3397, vl.c). Setting pc_init_pci as the init function of QEMUMachine happens even before calling QEMU “main” function. Qemu sets a constructor function that will be called automatically before loading the main module. This constructor calls register_module_init function to register QEMUMachine objects with their pc_init_pci function. Figure 3 shows the execution trace in the .init constructor of the main module. find_type finds the relevant abstraction struct, QEMUMachine in this case, and sets the function pointer.

The reason Qemu works as above is to be as platform neutral as possible. Many of the previous initialization is based on the configurations that happen at compile time even before executing Qemu.

Figure 3. .init execution

Main function

The “main” function initializes several parameters and prepares the Qemu environment for execution. The “main” function for the emulator exists in the vl.c file. The list of initialization actions is very large. We discuss four important initialization actions in this section. Figure 4 shows the state diagram for the four important initalizations. Through MODULE_INIT_MACHINE macro, QEMU calls the pc_machine_init function that registers the functions that will eventually lead to the execution of the QEMU main loop. Through configure_accelerator function, QEMU accomplishes several tasks. Some of these tasks are allocating tcg translation blocks, allocating code caches in form of an executable page and planting tcg_prologue function in memory. Further, QEMU calls cpu_exec_init_all function that, among others, initializes the mapped IO memories. Finally, QEMU calls the init function of the QEMUMachine object that is pc_init_pci function. The calls from this function leads to the execution of the main loop that executes through the whole life cycle of Qemu in a separate thread.

Figure 4. Main initialization flow

Qemu main execution loop

The execution of the main loop starts after calling the init function of the QEMUMachine that is pc_init_pci. The execution leads to the creation of a thread that executes the main loop. The thread creation trace is shown in the Figure 5. The thread creation leads to the execution of qemu_tcg_cpu_thread_fn function.

Figure 5. thread creation

qemu_tcg_cpu_thread_fn finally calls the cpu_exec function that contains the main execution loop. The execution trace leading to cpu_exec is depicted in Figure 6. In each iteration, cpu_exec does the followings. First, it serves any pending exception or interrupt. If there is no pending exception or interrupt, the main loop within cpu_exec tries to fetch the next block and executes. Fetching and executing basic blocks (a block of code without a branch or jump) include several mechanisms. In the next subsections, we explain three of these important mechanisms that are block translation, code generation and block chaining.

Figure 6. trace to cpu_exec

Block translation

The execution of the guest instructions is based on the translation of the basic blocks of a guest executable. In the beginning, Qemu starts the translation of basic blocks. Afterwards, if a basic block is to run again (for instance a second call to a function), Qemu returns the reference to an already translated block. The former is done via tb_find_slow and the latter via tb_find_fast function. Precisely, cpu_exec makes only one call to tb_find_fast and it calls tb_find_slow if it can not serve the request via already translated code blocks. This is done via Address Lookup Table (ALT). ALT is implemented by a hash table referenced by tb_jmp_cache field of the CPUState object. tb_jmp_cache hashes the next PC value (tracks CPU Instruction Pointer) and searches based on this hash value.

There are in fact two levels of caching for acceleration. The first level happens in tb_find_fast that we already explained. This caching helps us to find the reference to the cache code (machine code) in case we have it in our cache. The second level of the caching is in tb_find_slow. tb_find_slow caches the tcg code and if the request can be served from the cache, tb_find_slow does so. This happens in the lines 100 to 121 of the cpu-exec.c. tb_find_slow starts the translation process if the guest code has not been translated before. To translate, tb_find_slow calls tb_gen_code function. tb_gen_code first gets a reference to an empty translation block cache. Then, it calls get_page_addr_code function that gives a reference to the guest binary code. tb_gen_code then calls cpu_gen_code that does the tcg translation and writes it to the translation block cache. tb_gen_code, then, creates an entry in its cache table by calling tb_link_page with PC address, translation block pointer and a pointer to the physical page of the guest code.

The actual disassembly of the instructions happens in disas_insn function that is called through a series of other functions originating from cpu_gen_code. cpu_gen_code calls gen_intermediate_code function that calls disas_insn for disassembly. gen_intermediate_code takes into consideration especial cases like exceptions, traps and interrupts and act accordingly. The translation block can be found in struct TranslationBlock.

Code Generation

After disassembly and translation to the tcg code, it is time to generate the host executable code from the tcg IR. This process happens in tcg_gen_code. tcg_gen_code calls tcg_gen_code_common that in a for loop reads the tcg instructions and produces the machine code. More specifically, this function writes to the code cache pointed by the code_ptr of the TCGContext data structure. Many of the instructions are handled not directly in this function but in tcg_reg_alloc_op function. As a matter of fact, only a few instruction such as MOV, MOVI and CALL are directly handled in tcg_gen_code_common. For the rest, tcg_gen_code_common calls tcg_reg_alloc_op. After some preprocessing, tcg_reg_alloc_op calls tcg_out_op that checks the operations and the operands from the data structure and write the binary code.

Block chaining and patching

Qemu executes one block of the guest code at a time. After a block of code is executed in the host, the control must be transferred to the next block; this process is done through block chaining. By block chaining, we either directly jump to the next translated block or we jump back to the emulation manager (QEMU main loop). Initially, the control is often transferred to Qemu main loop but after each block is seen once, the blocks are chained together and we don’t have to return to emulation manager anymore.

Block execution

The execution of a block happens based on the host architecture. If the architecture is a supported architecture by QEMU, then there is a binary code generation of the code and hence the control should be transferred to the binary block. This will be done through the tcg_qemu_tb_exec call in cpu_exec function (see Figure 7). This function for an architecture specific execution will lead to code_gen_prologue execution that is basically a call to the beginning of a binary code block. A binary code block, or as mentioned in the QEMU code buffer is composed of three parts (see Figure 7). The translation block starts with executing the prologue. The prologue does the followings:

  • Saving the registers (including architecture specific registers such as xmm and mmx registers)
  • Adjusting the stack size
  • Setting the frame pointer
  • Transferring the control to the translation block

Figure 7. Translation block

After finishing execution, the control is transferred to the epilogue. The jump to the epilogue is inserted at the end of each block. End of a block is denoted by jump like instructions such as jump or call. The tcg instruction that denotes the jump to the epilogue is INDEX_op_exit_tb. Seeing this instruction, Qemu inserts a jump to the epilogue as below:

	tcg_out_op (…){
		---
		switch(opc) {
		case INDEX_op_exit_tb:
		    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_EAX, args[0]); //sina: the return value in EAX
		    tcg_out_jmp(s, (tcg_target_long) tb_ret_addr); //sina: note that inserting jump to Epilogue happens here
		    break;
		---
	}

tb_ret_addr in the above code is a pointer that points to the epilogue address. The epilogue does the followings:

  • Adjusting back the stack size

  • Restoring the registers

  • Returning After the epilogue execution, the control is transferred back to the cpu-exec loop through the “ret” instruction. Note that a call to tcg_qemu_tb_exec, which is indeed a call to the prologue, would store the return address so the “ret” instruction will transfer the control the instruction next to tcg_qemu_tb_exec. The prologue and the epilogue are constantly inserted in the beginning of every code buffer through the initialization phase by tcg_prologue_init function. This function calls tcg_target_qemu_prologue function that does both the prologue and the epilogue insertion. The code is shown below. Note that generally an epilogue and a prologue are part of an architecture calling convention and automatically generated for any compiled code with a compiler. However, since Qemu is doing inline translation and dynamic binary code generation, we must handle Epilogue and Prologue generation.

      static void tcg_target_qemu_prologue(TCGContext *s) 
      {
          int i, frame_size, push_size, stack_addend;
          /* TB prologue */
          /* Reserve some stack space, also for TCG temps.  */
          push_size = 1 + ARRAY_SIZE(tcg_target_callee_save_regs);
          push_size *= TCG_TARGET_REG_BITS / 8;
          frame_size = push_size + TCG_STATIC_CALL_ARGS_SIZE +
      	CPU_TEMP_BUF_NLONGS * sizeof(long);
          frame_size = (frame_size + TCG_TARGET_STACK_ALIGN - 1) & //aligning
      	~(TCG_TARGET_STACK_ALIGN - 1);
          stack_addend = frame_size - push_size;
          tcg_set_frame(s, TCG_REG_CALL_STACK, TCG_STATIC_CALL_ARGS_SIZE,                  CPU_TEMP_BUF_NLONGS * sizeof(long));
          /* Save all callee saved registers.  */
          for (i = 0; i < ARRAY_SIZE(tcg_target_callee_save_regs); i++) { 
      	tcg_out_push(s, tcg_target_callee_save_regs[i]);    }
          tcg_out_addi(s, TCG_REG_ESP, -stack_addend);    tcg_out_mov(s, TCG_TYPE_PTR, TCG_AREG0, tcg_target_call_iarg_regs[0]);
          /* jmp *tb.  */
          tcg_out_modrm(s, OPC_GRP5, EXT5_JMPN_Ev, tcg_target_call_iarg_regs[1]);    /* TB epilogue */
          tb_ret_addr = s->code_ptr; 
          tcg_out_addi(s, TCG_REG_CALL_STACK, stack_addend); 
          for (i = ARRAY_SIZE(tcg_target_callee_save_regs) - 1; i >= 0; i--) { //sina: Poping back the register contents.
      	tcg_out_pop(s, tcg_target_callee_save_regs[i]);
          }
          tcg_out_opc(s, OPC_RET, 0, 0, 0);
      }
    

On the other hand, if the architecture is not supported by QEMU, the execution happens by reading the tcg block instructions one by one and executing them through an interpreter. Note that unlike the former, the execution is not transferred to the binary block and happens in the tcg_qemu_tb_exec function in the tci.c file. Below is an example of an instruction execution:

unsigned long tcg_qemu_tb_exec(CPUState *cpustate, uint8_t *tb_ptr)
{	…
	for (;;) {

		switch (opc) {
			…
			TCGOpcode opc = tb_ptr[0];
			…
			switch (opc) {
				…
				case INDEX_op_mov_i32:
				    t0 = *tb_ptr++;
				    t1 = tci_read_r32(&tb_ptr);
				    tci_write_reg32(t0, t1);
				    break;
				…
			}
		}
	}
}  

Return to the emulation manager

The execution sometimes needs to return back to Qemu as we explained above. To return to Qemu after executing one block, we add a jump at the end of the block. This is done, among other, in gen_eob that is inserted at the end of code translation in gen_intermediate_code (see Block chaining and patching). Specifically, gen_eob is inserted after every instruction that marks end of a block for instance a jmp or a call instruction. The trace from cpu_exec to the important functions discussed in the previous few sections is shown in Figure 8.

Figure 8. main trace for block translation and interpretation

Block Chaining

After we translated a block, we patch the previous block to jump to this block directly. This is done in tb_add_jump function that is called after block translation in cpu_exec. tb_add_jump adds just a pointer to the next block for the previous translation block object.

Interrupt and exception handling

Faults, I/O requests and exceptions are all handled in the cpu_exec function. Note that this function is the entry point after every jump back to the QEMU; for instance, when a block reaches its end and its next block is not yet chained or translated. In the beginning of this function, generated I/O requests, faults or exceptions are handled. The for loop of this function starts with a setjmp. This ensures that every fault will be redirected to the same loop through a longjmp. Afterwards, before translation, interpretation or guest execution, QEMU serves the pending exceptions and interrupts. Below is a highlight of the code:

int cpu_exec(CPUState *env)
{
…
for(;;) {
    if (setjmp(env->jmp_env) == 0) { 
	/* if an exception is pending, we execute it here */
	if (env->exception_index >= 0) {
		….
	}
	else {
	….

	            do_interrupt(env);
	            env->exception_index = -1;
	…
	for(;;) {
   	 interrupt_request = env->interrupt_request;
	//logic to handle other interrupt types
	//block translation, interpretation, execution and chaining
}

Note that all user exceptions are served via do_interrupt. The same does not hold for other interrupt types. Depending on the interrupt type cpu_loop_exit, do_smm_enter, do_interrupt_x86_hardirq, or cpu_get_pic_interrupt may be called.

The reader should note that as mentioned in Hardware emulation section, QEMU instruments the ordinary interrupt handling mechanism of the operating system under analysis. Henceforth, most of the raised exceptions and faults are caused by the instrumented code. The functions that are used to generate exceptions and interrupts are raise_exception_err and raise_interrupt.

Qemu memory management and Page fault handling

In this subsection, we explain in detail one of the Qemu hardware exception handling namely page fault handling. Page fault management in QEMU is important due to several reasons. Firstly, since QEMU is a software emulation (or virtualization), its page fault management is different than an operating system. Secondly, the tainting functionality of DECAF relies on the page fault management mechanism of QEMU.

QEMU internally implements data structures to resemble TLB cache for the guest. This means there is another level address virtualization in addition to the host address virtualization. Since QEMU has this virtualization, before executing each instruction, the memory addresses to the instruction must be translated to a host resolvable address. Qemu memory address translation is done while executing an INDEX_op_qemu_[st|ld] IR. INDEX_op_qemu_[st|ld] are MIPS like operations to load or store value from a memory address to a CPU register. INDEX_op_qemu_[st|ld] is inserted at trasnlation time for every guest instruction that references a memory address. At code generation time, QEMU starts converting a guest address to a host resolvable address when seeing an INDEX_op_qemu_[st|ld] instruction.

Example: In order to better understand the process, let’s review an example. Addition, subtract and compare operations in x86 are performed with opcodes within [0x00…0x3d]. QEMU uses the same logic to translate these codes (see below code snippet from disas_insn function).

case 0x00 ... 0x05:
case 0x08 ... 0x0d:
case 0x10 ... 0x15:
case 0x18 ... 0x1d:
case 0x20 ... 0x25:
case 0x28 ... 0x2d:
case 0x30 ... 0x35:
case 0x38 ... 0x3d:
    {
	...
        switch(f) {
		 case 0:
			gen_op(s, op, ot, opreg);
		 ...
		 case 1:
			gen_op(s, op, ot, opreg);
		 ...
	...
	}      

The trace from gen_op, which is responsible for the translation of the above opcodes, lead to the tcg_gen_qemu_ldst_op function that inserts the INDEX_op_qemu_st16 operation on the translation block. Below is the trace to tcg_gen_qemu_ldst_op.

     disas_insn
     	gen_op
     	     gen_op_st_T0_A0
     		gen_op_st_v
     			tcg_gen_qemu_st16
     				tcg_gen_qemu_ldst_op     

Note that for a single Add or Sub atomic instruction that performs in memory mode, one INDEX_op_qemu_* instruction will be inserted. Afterwards in the binary generation phase, whenever the above operations are observed in the tcg block, a call to the corresponding tcg_out_taint_qemu_[ld|st] is issued that does in turn, among others, the address translation. tcg_out_taint_qemu_[ld|st] inserts binary code that either returns the corresponding host address of a guest address or raises a page fault. The call to the page fault handling function is inserted in the code via: tcg_out_calli(s, (tcg_target_long) qemu_ld_helpers[s_bits]);

The above referenced array is defined as:

     static void *qemu_ld_helpers[4] = {
         __ldb_mmu,
         __ldw_mmu,
         __ldl_mmu,
         __ldq_mmu,
     };     

__ldb_mmu can not be found in the code because it is defined through a glue mechanism. An interesting reader can see softmmu_template.h, line 94:

  /* handle all cases except unaligned access which span two pages */      
  DATA_TYPE REGPARM glue(glue(__ld, SUFFIX), MMUSUFFIX)(target_ulong addr,       
                                                  int mmu_idx)       

Anyhow, the execution leads to cpu_x86_handle_mmu_fault function at runtime. The full trace is:

_ldb_mmu()      
  tlb_fill()        
       cpu_x86_handle_mmu_fault()         

The last function sets env->exception_index inline. However, the actual exception is raised in tlb_fill by a call to raise_exception_err. The trace from this function leads to cpu_loop_exit which in turn makes a longjmp back to the cpu_exec. As mentioned in the beginning of this section, cpu_exec function processes this exception after it takes control of the execution via a call to the do_interrupt. The trace to cpu_loop_exit is as follows:

tlb_fill
   raise_exception_err
        raise_interrupt
        cpu_loop_exit