Internals: backtracking - troyp/jq GitHub Wiki

Introduction

The jq language is a generator/backtracking-type functional language. Expressions produce zero, one, or more results. Producing zero results is the same backtracking, and as a result the empty builtin triggers backtracking.

When backtracking, the VM unwinds the stack, freeing garbage as it goes, up to a saved point (a ”fork point“), restarting at that point. The resumed instruction knows that a backtrack has occurred and can do a variety of things as a result, such as: produce the next result in some low-level operation (e.g., EACH, for .[], INDEX for .[expr]). Many built-in generators work by catching backtracking and producing the next value until there are no more values to produce, in which case they backtrack.

Fundamental generators in jq:

  • the comma operator (FORK opcode)
  • .[] (iterate values of .) (EACH opcode)
  • range(init;upto) (RANGE opcode)
  • the path() builtin (PATH_BEGIN and PATH_END opcodes)
  • the last expression, the one that outputs values to the calling C program (RET opcode)

Generators can be implemented in terms of existing jq language functionality. For example, the new range(init;upto;by) builtin is a jq-coded function, using only:

  • the comma operator for generation
  • tail-call optimized recursion for iteration
  • sub-functions (yes, jq functions can define sub-functions!)

Deep dive

Programmers familiar with the internals of Schemes, LISPs, or more esoteric languages like Prolog or Icon, will probably understand easily how jq generators and backtracking work.

Programmers familiar with GCC's local function extension to the C language will also easily understand how jq generators and backtracking work.

For everyone else the key to understanding jq generators and backtracking is as follows:

  • Calling a function pushes a new call frame on the jq interpreted stack, much as one would expect in a typical C implementation
  • Returning from a function, however, is actually like a yield in Python and other such languages:
    • The RET instruction calls frame_pop(), which one would expect…
    • …but the frame_pop() function often does not pop anything from the stack, instead it changes the frame pointer to point to the previous frame.
    • The RET instruction then pushes a backtrack point onto the stack with stack_push(). The ”return“ v

The effect of this is that returning from a jq function leaves the callee's frame on the stack and pushes a backtrack point. Eventually the callee will be restarted by backtracking to its RET instruction, which will restore the current frame pointer to be the callee's. If the caller being yielded to needs to push more things onto the stack, these get pushed past the callee's frame. At some point the callee will have no more values to output, in which case it will backtrack. Backtracking is more similar to returning in C than is yielding.

Note that frames are back-linked in a way that corresponds to the jq program's pipes. The compiler counts the nesting level (number of pipes to cross) to get to a symbol such as a variable reference, and records this as a “level” in the bytecode produced for, e.g., a LOADV instruction. The interpreter then traverses that many frames to find the frame in which the binder for that symbol is to be found. Note that frames of callees are not in the caller's back-link list! This is how lexical binding works in jq then, and it is akin to deep binding in that it requires iteration to find symbols, with global-most symbol references requiring the most iteration to resolve. (The run-time could use something more like shallow binding, but the cost would be to copy, at frame-push time, all bound references that might be needed by the new frame's bytecode.)

A compilation to C with local functions would map the jq output/yield/return operation to a function call of a continuation, where a continuation is the function pointer of a local function that closes over local variables of its parent function. Backtracking would map to a normal C function return, or perhaps to calling a computed goto (another GCC extension to C).

Of course, a better compilation to low-level code would avoid using extended C local functions as those require a system call to permit execution of the stack page where the local function object is allocated. (The GCC local function implementation creates executable function objects on the stack in order to take their address; these are known as ”trampolines“. In principle C could have larger pointers for local function pointers such that calling such a function pointer does not require an executable trampoline on the stack, but in practice the type name for such function pointers would have to indicate that the pointer is a closure pointer, and C has not been extended to make this possible, not even by GCC, therefore trampolines are unavoidably necessary.) Instead a standard C function + callback argument could be used, or compilation to the LLVM IR could be used to bypass C altogether to produce even more optimized code.