A Brief bit of Background - nonarkitten/pseudo-jit GitHub Wiki

Threading is a term borrowed from the FORTH programming language. Without introducing too much terminology here, in traditional FORTH a thread is simply a list of addresses either pointing to machine code or another list of addresses. There are four common threading models including direct, indirect, subroutine and token threading. If you're interested in the deeper inner workings of FORTH, I would strongly recommend Brad Rodriguez's Moving Forth series here.

Interpreters are a form of token threading where each opcode is read from instruction memory and the interpreter then branches to the subroutine for that opcode. It is the simplest form of emulation but is seldom the fastest; the overhead of fetching, decoding and branching to the opcode can be quite high and often takes more processing time than the instruction itself.

Recompilers omit the threading overhead by translating blocks of code into native machine instructions and operate on the (mostly correct) assumption that code is executed more than once. Each chunk is recompiled at the entry point up to the first branch where the new code is expected to return a new entry point to continue the process. Previously recompiled sections are then retrieved from a cache to avoid the overhead of translation.

However, they are not perfect.

First of all, the time to recompile code can be quite expensive and is much slower than interpreted code. This overhead leads to micro-stutters (jitter, pun not intended) in the execution of code and a worsening of performance for rarely executed code. Recompilers can perform so poorly that most also implement standard interpreters for these cases and only use recompilation when obvious loops or frequently used subroutines are known.

Secondly, the code which is produced may be an order of magnitude larger than the original, resulting in a proportional demand on memory. While the cost of memory is relatively cheap, this can impact cache locality and negatively impact performance. This is exacerbated when the emulator recompiles the same region of memory several times over due to different entry points.

And finally, recompilers are quite complex, having to address incongruities of different processor architectures while attempting to emit the most compatible and ideally, fastest code possible. Applying optimizations and attempts to mitigate the first two issues only complicate the process more.

Threaded JIT

I started this with a brief introduction to the classic FORTH threading models. Token threading is the natural model for an interpreter as the opcodes of the guest processor are equivalent to tokens. In most cases, these tokens are only partially resolved by the token threader and some limited opcode decoding is still performed within the subroutine. Direct, indirect and subroutine threading have generally been avoided entirely.

Peter McGavin is the first I'm aware of to implement a threaded JIT in his Z80 emulator. It works by creating a 64K entry jump table where each entry corresponds to the opcode taken from the instruction memory. Upon initialization, this jump table points to the opcode lookup routine itself, which performs in-place substitution to the actual opcode handler before jumping to it directly. When code reaches this point a second time, the opcode handler is executed immediately without reexamining the instruction memory.

Richard Carlsson noted in his adaptation of this approach in his Z80 emulator, that the resulting speedup over a traditional emulator is about two-fold.

As with FORTH, in this model there is no longer a traditional interpreter; at best, the opcode lookup routine serves in its place. And while that function may have some overhead, the execution of individual instructions still occurs sequentially, eliminating the problem of jitter -- programs are now easily deterministic. Even if a threaded JIT is not as fast as a normal JIT, it is quicker.

Secondly, because the opcode is only replaced with a single subroutine call or address, there's a fixed "expansion ratio" that is not only much smaller than code generated by a recompiler but having a fixed ratio means that branches and jumps are now trivial to compute and the emulator can freely jump into any address in memory whether it has been precompiled or not.

Finally, compared to a traditional recompiler, a threaded JIT is trivial to implement. More on that later.

Despite its performance over interpreters and even recompilers in some cases, threaded JIT did not become popular and was largely forgotten. The overhead of a subroutine call and return on each instruction was too great and the performance gain was not as impressive as normal JIT -- especially when up against benchmarks that unfairly exaggerate the benefit of JIT. This simple model also scales poorly in 32-bit systems trying to emulate other 32-bit systems as there simply isn't enough memory for the jump table.