Advanced Topics - troyp/jq GitHub Wiki

Scope of Symbols
Closures
Local functions
Tail-Call Optimization
Generators
Lazy Evaluation
Streams
Recursion and Tail Recursion Optimization

Scope of symbols

jq is scoped strictly lexically. Symbols are only visible to entities in their implementation or following them. In this example:

def foo(x): x * 2;
def bar(y): foo(y);
bar(.)  # main

foo is visible to bar and the main program, bar is visible to the main program, x is visible only within the implementation of foo, and y is visible only in the implementation of bar.

This is true of all symbols, including $variables. Functions can refer to themselves (otherwise jq could not support recursion).

$variables are visible only to expressions to the right of the |, and in the same function.

Variables defined with the --arg and --argfile command-line arguments are global, visible to all functions and the main program as long as they are not shadowed.

Shadowing is allowed. There is no operator by which to refer to a shadowed binding. This only matters when referring to --arg and --argfile globals or when referring to functions defined in other modules (in the module system introduced in jq 1.5).

There are no "dynamic" variables. There is no assignment to variables.

In LISP terms, jq is a LISP2 because it has separate namespaces for functions and values, but there is no need for FUNCALL/APPLY because there are no function values.

Closures

All functions in jq are closures. Top-level functions capture their lexical namespace, which consists of prevously defined functions only. Local functions capture their lexical namespace, which consists of all "visible" bindings (e.g, $foo) and all "visible" function names.

All function arguments in jq are actually functions whose bodies are the expressions used as the actual arguments. I.e., function arguments are closures; referring to a function argument calls the corresponding closure.

There are no closure values. For example, one can't have an array of closures, and . can never be a closure. Closures can only be passed to functions as function arguments (but not inputs).

Conversely, function arguments are not values.

Every expression is an invocation of a closure. E.g., 1+2 is an expression invoking + (_plus) on the outputs of the expressions 1 and 2 (which are, respectively, 1 and 2).

For example:

def foo(x): .a as $a | x.b * 2;
{"a":{"a":3, "b":4}} | .a as $a | foo($a)

This should output 8.

Here x is a closure, and in the one invocation of foo(x) the body of x is $a, which is the $a that was visible at the call site of foo. The $a binding made within foo is not visible in the body of the argument closure x.

Local Functions

jq supports local functions -- functions defined within functions and visible only within them (to functions and expressions following them).

Local functions are often appropriate when a function requires a private helper function.

Mutual recursion is only possible using local functions:

# Fails to compile:
def foo: bar;
def bar: foo;
foo

# This compiles:
def foo:
  def bar: foo;
  bar;
foo

Tail-call optimization

Tail-call optimization (TCO) is an optimization that allows some instances of recursion to not grow the execution stack. TCO makes it possible to build practical and efficient iterators using recursion.

Currently (jq 1.5rc1 and up) only functions with no arguments are amenable to tail-call optimization (TCO), which makes local functions useful for taking advantage of TCO.

Here is an example of TCO from jq's builtins (in jq 1.5rc1):

def range(init; upto; by):
  def _range:
    if (by > 0 and . < upto) or (by < 0 and . > upto) then ., ((.+by)|_range)
    else .
    end;
  if by == 0 then init
  else init|_range
  end
  | select((by > 0 and . < upto) or (by < 0 and . > upto)) ;

The local helper _range is needed to take advantage of TCO.

Notes

Function bodies, like the top-level program, consist of a single "statement" following function definitions.
Therefore any local function definitions must be placed ahead of the body of the function.
Local functions may themselves contain local functions.
The function arguments (which are always closures) of a function are visible to its local functions.
Shadowing of symbols is permitted.

Generators

All jq expressions consume an input value and may produce zero, one, or more output values. empty produces no values, for example. 1,2 produces two values. range(0;10) produces ten values.

Some expressions can only produce one value per input. For example, literals (e.g., 1, "hello"), and also ..

Other expressions can produce any number of values. For example, the range functions can produce many outputs.

Each output is passed along to the next pipeline stage as its input, or if the generating expression is the last expression, then it is printed on the standard output stream.

When jq outputs a value, it backtracks to the previous expression in the pipeline.

When an expression in a pipeline completes, it backtracks to the previous expression in the pipeline.

Repeat until a complete pipeline as applied to one input value is exhausted. Then the next input is read and the pipeline is applied to that next input.

Primitive generator expressions:

The comma operator: 1,2 produces two values, here 1 then 2. The expressions separated by , can each produce zero, one, or many values.
The .[] operator produces all the values in . (which can be an array or object).
The range builtin (enumerates integers in a range).
The path builtin (enumerates paths in . matching the argument expression).
Function calls, which via recursion can produce indefinitely many outputs.

Some noteworthy points about jq generators:

Generators generate all their values to completion. Generally speaking, generators can only be interrupted by throwing an error with error or by breaking out. But see below.
Function arguments, being closures, are like promises. An expression like f(g) does not first cause the outputs of g to be produced; instead, f is called with a closure whose body is g, and when f refers to its argument, it calls g (since that is what the argument expression does) to produce one or more outputs.
It's very easy to get unintended cartesian product behavior if one isn't careful.

Consider:

    range(0; 2)
    0
    1

    {"a": 1} | has("a")
    true

    {"a": 1} | has("a", "a")
    true
    true

In the second invocation of has, the argument is a closure with "a", "a" as its body, which produces "a" twice, so has() outputs twice as to whether the given string is a key in the input value.

Lazy Evaluation

Expressions are evaluated lazily in jq, but in jq 1.4 and earlier, all outputs of an expression must be generated (even if they are subsequently filtered out). That is, a first(exp) function in jq 1.4 can produce the first output of exp, as expected, but exp's outputs have to all be produced (and all but the first will be ignored). In jq 1.5rc1 one can define a first(exp) function (and jq does have such a builtin) that produces only the first output of exp and which does not need to internally also consume all other outputs of exp.

In jq1.5rc1+ there are builtins and new syntax (foreach, break) that allow one to interrupt generators. For example, the limit/2 builtin outputs only the first N outputs of a given expression:

limit(2; range(0;10))
0
1

limit/2 achieves this power courtesy of the special construct foreach. The range(0;10) expression outputs only twice before its evaluation is terminated. That is, jq expressions are evaluated lazily.

In jq 1.5rc1+, there are other constructs useful for harnessing generators, notably foreach and while, and utility functions like first(exp) and limit(n; exp).

This:

limit(1,2; range(0;10))

may be surprising. Its outputs are 0, 0, and 1.

Streams

jq and jq filters are designed to process streams of JSON entities. A stream, however, is itself not a JSON entity, nor is it a jq value type.

Expressions can output zero, one, or more values for each input. The right-most/outer-most expression's output values are encoded and printed on stdout, followed by a newline. Those outputs are a stream. Inputs are a stream as well.

If the input to a jq filter is a stream of entities, then each is presented to the next filter in the pipeline separately.

jq also has a streaming JSON parser that outputs a stream of path/value pairs that can be used to reconstruct a value that would have been produced by the normal JSON parser. The streaming parser is not what this section is referring to.

The "," operator

The "," operator can be thought of as concatenating streams. For example, the expression

[ range(0;10), range (10; 20) ]

produces the same array as [ range(0;20) ].

"Regular" functions

Consider a function, say f/1 (a function with arity 1). If (null | f(S)) emits the same stream as S | f(.), then we say that "f is regular as to its first argument".

In jq1.4 and earlier one can define such functions like this:

def f(g): g as $g | ...; # only referring to $g in ...

In jq1.5rc1 a "regular" function can be defined with convenient new syntax:

def f($g): ...; # only referring to $g in ...

Examples:

Short-circuit semantics: and is regular in the first input but irregular in the second.
If S is a stream with n items, then S | (false and .) is a stream of n values, but null | (false and S) simply yields false.

Conversion between streams and arrays

The [ ... ] operator can be used to collect the expression's stream of outputs into an array. For example, [0,1] is an expression collecting the stream of the two specified values (0 and 1) into an array (internally such a constant expression is constant-folded), while [range(.;.*2)] collects the integers from . up to but excluding twice that.

Arrays can be converted into streams using the postfix [] operator, or | .[], e.g. [1,2][] outputs a sequence of values: first 1, then 2.

"Cartesian product" behavior of some expressions

Some expressions have a kind of "Cartesian product" behavior. For example, range(0,1;3,4) is the same as range(0;3), range(0;4), range(1;3), range(1;4). Such behavior can be surprising, so beware. The visual similarity of , and ; can cause confusion, especially when a function is defined with several arities. For example, range comes in 1-, 2-, and 3-arity formulations, and therefore range(0,10) and range(0;10) look similar, but do very different things.

For another example, the expression "\( (true, false) ) and \( (true, false) )" produces 4 strings, whereas the expression (true, false) and (true, false) produces only three boolean values, because 'false and (true, false)' evaluates simply to false as the generator is never given the chance to start.

Recursion and Tail Recursion Optimization

In general, the use of recursion is limited by stack depth, and deep recursion is slow. But in jq 1.5rc1+ an important class of tail-recursive functions are automatically optimized to avoid this. In particular, arity-0 tail-recursive functions are automatically optimized.

It's often possible to rewrite a recursive function of any arity into a tail-recursive 0-arity function using a local 0-arity function.

Consider, for example, the standard recursive definition of the factorial function:

def fact(n):
  if n <= 1 then n
  else n * fact(n-1)
  end;

This is nearly tail-recursive, but not quite. To make it tail-recursive, one can introduce an accumulator. To do this and to avoid the recursion-depth penalty, we will introduce a 0-arity tail-recursive local function:

def fact(n):
  def _fact:
    # Input: [accumulator, counter]
    if .[1] <= 1 then .
    else [.[0] * .[1], .[1] - 1]|  _fact
    end; 
  # Extract the accumulated value from the output of _fact:
  [1, n] | _fact | .[0] ;

The trick is to use a local function, and to set things up so that all the information it needs at each step is available either from its inputs or from the lexical environment captured by the local 0-arity function.

An efficient way to compute factorials is to use reduce:

def fact: reduce range(1; .+1) as $i (1; . * $i);