Internals: block representation - troyp/jq GitHub Wiki

The parser (src/parser.y) emits a ”block“ representation of the parsed program. The compiler mutates this representation, and then generates bytecode from it. The block representation of a jq program is very similar to the bytecode that one sees when using the --debug-dump-disasm option to the jq executable, but it has more structure and more information. A work-in-progress may add a --debug-dump-block option to the jq executable that will print a JSON representation of a program's parsed and compiled block form, which might then be useful for such things as writing syntax highlighters.

The block representation of a jq program is a tree structure resembling an AST, but some syntactic information is lost. In general, all places in src/parser.y where gen_noop() is used to generate a block will lose information. For example, . becomes a noop block. Also, note that gen_noop() doesn't return a block with a single NOOP instruction -- gen_noop() returns an empty block, which is why parser rule actions that output a gen_noop() will lose information: there's no inst object (see below) in which to preserve any information.

The C type for this is block, a typedef of a very simple C struct:

struct inst;
typedef struct inst inst;
typedef struct block {
  inst* first;
  inst* last;
} block;

A block is a tuple pointing to the first and last instructions of an instruction chain. The struct inst type is opaque, private to src/compile.c, but we can describe it here, with some simplification, as a struct with these fields:

struct inst* next;
struct inst* prev;
opcode op;
struct {
  uint16_t intval;
  struct inst* target;
  jv constant;
  const struct cfunction* cfunc;
} imm;
struct locfile* locfile;
location source;
struct inst* bound_by;
char* symbol;
block subfn;   // used by CLOSURE_CREATE (body of function)
block arglist; // used by CLOSURE_CREATE (formals) and CALL_JQ (arguments)
struct bytecode* compiled;
int bytecode_pos; // position just after this insn

The next and prev fields are used to make the instructions a linked list.

The op field is the instruction opcode.

The imm field is the instruction's immediate operand, if it expects one. Instructions with branches (e.g., JUMP, JUMP_F) point to the target of the branch via imm.target. intval is used for various internal purposes, such as counting how many frames to the left of a $varname provides the binding for $varname, so that at run-time the interpreter can search backwards through that many frames (note: not literally frames on the stack, but frames in a list on the stack). The other fiels of imm are self-evident.

The locfile and source fields refer to a source jq program and start/end byte offsets into that program.

The bound_by field points to an inst that this instruction will refer to. For example, a CALL_JQ will use this to point to the function that should be called.

The symbol will have the name that this instruction is an implementation for, if any.

The subfn field is a block representation of a function's body, which includes its sub-functions, thus the field's name. arglist has the argument closure bodies in a function call.

The compiled field points to bytecode once the block has been compiled.

The bytecode_pos has the resolved address of the next instruction once this block has been compiled.