Multidimensional array comprehensions - BenoitKnecht/julia GitHub Wiki

Julia introduces a new programming construct called a multidimensional array comprehension or just an array comprehension for short. Array comprehensions are similar to the notion of list comprehensions in other languages such as Python, Haskell and OCaml, but extend this notion to manipulate and generate arrays of arbitrary dimension. Each dummy variable used in the array comprehension corresponds to a dimension of the output array; if the comprehension expression value itself has non-zero dimension then the total dimension of the output is the number of dummy variables plus the dimension of the value. If we view n-d arrays as rank-n tensors, and thus see 1-d arrays as vectors and 2-d arrays as matrices, many common vector, matrix and tensor operations can be expressed concisely and clearly using array comprehensions.

Array comprehensions are envisioned to be a key construct of the julia language that will differentiate it from other languages. On one hand, Matlab is pretty fast, and it is almost impossible to beat Matlab at what it does. On the other hand, Matlab codes when rewritten in C often run 10 times faster. What gives? The key idea behind array comprehensions is to express almost all operations at a higher level, close to their mathematical definitions. Optimizations can then be applied to this high level to produce heavily optimized low-level code, for multiple comprehensions and compounded comprehensions. This scheme should allow the generation of optimized low-level code from a high level specification that matches the structure of optimized C code written by a good C programmer.

Table of Contents

TODO: UPDATES

Update what follows with the following modifications:

  1. allow placing expression dimensions explicitly with ...
  2. ... defaults to the beginning; i.e. expression dimensions come first, not last
  3. remove/fix uses of ... in examples below
  4. fix various ... identities

Array Comprehension Notations

The following sections give three increasingly succinct syntaxes for writing array comprehensions. Through the forms, we follow the basic example of how to express the row sum and column sums of a matrix. We first present the most general, verbose notation for array comprehensions together with its meaning, and then present two increasingly abbreviated but equivalent forms.

Explicit range

This is the most general form for writing array comprehensions. The general form is:

A = [ F(x,y,...) | x=rx, y=ry, ... ]
The meaning of this form is that F(x,y,...) is evaluated with the variables x, y, etc. taking on each value in their given list of values. Values can be specified as any iterable object, but will commonly be ranges like 1:n or 2:(n-1), or explicit arrays of values like [1.2, 3.4, 5.7]. The result is a tensor object with dimensions that are the concatenation of the dimensions of the dummy variable ranges rx, ry, etc. and the dimension of the result of each F(x,y,...) evaluations—which must all have identical shape. See below for details on array comprehensions of non-scalar expression.

To clarify, let's consider the concrete example of computing row and column sums of a matrix, A. For a single dummy variable, the array comprehension notation with explicit variable ranges is similar to list comprehensions in other languages:

row_sums = [ sum(A[i,:]) | i=1:dim(A)[1] ]
col_sums = [ sum(A[:,j]) | j=1:dim(A)[2] ]
Here the sum function is assumed to produce a single scalar value for an array of any dimension giving the total of all entries in the array. The dim function gives a list of the sizes of the dimensions of a given multi-dimensional array. The output values are vectors (1-d arrays) containing the row and column sums of A, respectively.

Implicit range

Since the range of dummy variables will often be the entire set of possible index values, there is a more terse notation, where the range of values is implicit:

row_sums = [ sum(A[i,:]) | i ]
col_sums = [ sum(A[:,j]) | j ]
This is completely equivalent to the explicit form except that the range of values for dummy variables are automatically determined from the shape of the arrays they are used to index into. The following rules determine when a dummy variable may be used with an implicit range:
  1. The variable must be used at least once as a direct index into some object. The variable's implicit range will be the range of index values for that indexed object in the indexed dimension.
  2. Every instance of the variable being used to directly index into an object must have the same implicit range.
  3. The variable may not be used as an index via a compound expression. For example, for the comprehension expression A[i+1,:], the variable i cannot have an implicit range.
Variables with implicit and explicit ranges may be freely mixed in the same array comprehension: only variables without explicit ranges need to satisfy these requirements. It is acceptable to use a variable outside of indexing expressions as long as the above rules are satisfied. For example, the following are acceptable:
u = [ v[i] + i | i ]
w = [ f(i, v[i]) | i ]
The following are not acceptable:
[ f(i) | i ]                                 # WRONG: i is never used as a direct index
[ v[i+1] | i ]                               # WRONG: no indexing via compound expressions
v = [1,2]; u = [1,2,3]; [ v[i] + u[i] | i ]  # WRONG: conflicting implicit ranges for i

Underscore

Note: This notation is still under consideration and may not be deemed acceptable. The main concern is that the question of how much of the surrounding expression the _ "consumes". It seems that any rule used here means that substitution of expressions using _ as subexpression in larger context can change the meaning of the subexpression in context. This may be deemed too problematic.

A third and even more terse form may be used. In this form, the pseudo-variable _ is used instead of dummy variables and the [|] surrounding the expression is omitted entirely. Each instance of _ is assumed to be a new dummy variable, ranging over its full implicit range. This allows us to simplify the row and column sum example even further:

row_sums = sum(A[_,:])
col_sums = sum(A[:,_])
This is completely equivalent to the above expressions but bears much greater superficial resemblance to Matlab's notations sum(A,1) and sum(A,2), but of course with much greater generality. Also note that an expression with no _ may be considered as a special case of this notation with zero dummy variables. This observation motivates the omission of the outer []'s.

One immediate question regarding the underscore notation is how much of the surrounding expression do underscores "consume"? The rule is that _ expands to consume the surrounding expression but stops at containing comprehensions. Thus the expression

sum([ A[_,j].*B[j,_] | j ])
disambiguates as
sum([ [ A[i,j].*B[j,k] | i,k] | j ])
rather than
[ sum([ A[i,j].*B[j,k] | j ]) | i,k ]
The latter expression computes the matrix product, A*B, whereas the former computes a scalar value. (With some effort, it can be seen that this value is the sum of the values of entries of the product A*B.) The motivation for this rule is that an expression like
A[_,j].*B[j,_]
should have the same meaning by itself or when used inside a comprehension. Equivalently, we want the parser to be able to interpret [ ... | j ] without having to parse all of the inner "..." expression first. There is however some danger that remains: if such an expression is embedded as a subexpression of a larger expression, its meaning may change:
sum(A[_,j].*B[j,_])
means
[ sum(A[i,j].*B[j,k]) | i,k ]
rather than
sum([ A[i,j].*B[j,k] | i,k ])
This seems to be an unavoidable consequence of not providing explicit scope indicators on _ expressions.

Array Comprehensions of Non-Scalar Expressions

If the computed expression of an array comprehension is not a scalar, then all the values produced must be arrays of identical shape. The dummy variable vectorization dimensions come before the expression dimensions making the total dimension of the output array the number of dummy variables plus the dimension of the inner expression. This convention can be expressed by the following identity. If F is a function mapping scalars to vectors, the following holds:

[ F(i) | i=ri ] == [ F(i)[j] | i=ri, j=rj ]
For example, when applied to a 2-d array, A, this yields
A[_,:] == [ A[i,:] | i ] == [ A[i,j] | i,j ] == A
On the other hand
A[:,_] == [ A[:,j] | j ] == [ A[i,j] | j,i ] == A.'
That is, A[:,_] is the transpose of A. In general, when applied to a n-d array, the output is an n-d array is related to the input array by permutation of its indices; in particular, those "sliced" by _ are moved to the front, while those sliced by : are moved to the back. If A is a 3-d array, this gives the following equivalences:
A[:,_,:] == [ A[i,j,k] | j,i,k ]
A[:,:,_] == [ A[i,j,k] | k,i,j ]
A[_,:,_] == [ A[i,j,k] | i,k,j ]
A[:,_,_] == [ A[i,j,k] | j,k,i ]
Not all dimension permutations can be expressed using _. In this case, the permutation [A[i,j,k]] cannot be expressed.

Using this convention, we may convert expressions with multiple dummy variables into nested comprehensions with a single dummy variable each:

[ F(i,j) | i=ri, j=rj ] == [ [ F(i,j) | j=rj ] | i=ri ]
Note the order in which the variables occur is the same as the order in which their comprehensions are nested—but the order in which variables appear after the main expression when nested is reversed. Since the dimensions added by array comprehension variables are at the front of the expression dimensions, the outermost variable corresponds to the first dimension of the result value; the reversal afterwards is only an artifact of "popping the variables off the stack".

Examples

Matrix multiplication

Assume that A and B are matrices with dimensions [a,b] and [b,c]. The following computes their product:

C = [ sum([ A[i,j].*B[j,k] | j ]) | i,k ]
This example demonstrates a situation where the requirement that all uses of a dummy variable with implicit range would cause an error if used in multiple locations with different range: if A and B had different inner dimensions, then this would produce an error. Since .* is a vector operation, we can simplify this expression by omitting the inner comprehension, using a vector product of slices instead:
C = [ sum(A[i,:].*B[:,k]) | i,k ]
This can be simplified further using underscore notation:
C = sum(A[_,:].*B[:,_])
Matrix multiplication using an opaque inner product function may be expressed just as simply:
C = inner_product(A[_,:],B[:,_])

Matrix transpose

Assuming that A is an [a,b] matrix, the following expresses its transpose:

A[:,_] == [ A[i,j] | j,i ] == A.'
See above for a discussion of why A[:,_] produces the transpose of A.

Matrix diagonals

The main diagonal of a square matrix may be taken like this:

d = [ A[i,i] | i ]
If A is not square, an explicit range is required:
d = [ A[i,i] | i=1:min(dim(A)) ]
Other diagonals also require explicit ranges:
d = [ A[i,i-1] | i=2:min(dim(A)) ]
The diag function could be defined thus:
function diag(A,k=0)
  return [ A[i,i+k] | i=(1+max([0,k])):min(dim(A)) ]
end

Tensor slice summation

Assume T is a 3-tensor with dimension [a,b,c]. This computes a vector of the sums of matrices at each value of the second dimension:

sum(T[:,_,:])
The output has dimension [b]. A matrix of sums of the second dimension:
sum(T[_,:,_])
The output has dimension [a,c].

Comprehension Elimination

One basic observation allows for a broad class of computational simplifications and optimizations. The observation is that indexing into a comprehension with a scalar value is equivalent to eliminating the corresponding dummy variable from the comprehension altogether and simply replacing it with the scalar value:

[ F(i) | i=ri ][j] == F(ri[j])
Note that in the case where ri has the form 1:n, we have ri[j] == j for all j from 1 to n. The scalar value may itself be a dummy variable in an outer comprehension expression:
[ [ F(i) | i=ri ][j] | j=rj ] == [ F(ri[j]) | j=rj ]
However, this simplification only applies to indices that are indexed by a scalar so range indices may have to be expanded into comprehensions before they can be eliminated or simplified again:
[ F(i,j) | i=ri, j=rj ][k,:]
 == [ [ F(i,j) | i=ri, j=rj ][k,h] | h ]
 == [ F(ri[k],ri[h]) | h ]
Fortunately the results are fairly intuitive:
[ F(i,j) | i=ri, j=rj ][:,h]
 == [ [ F(i,j) | i=ri, j=rj ][k,h] | k ]
 == [ F(ri[k],rj[h]) | k ]
If functions defined in Julia using array comprehension are inlined, many expressions become nested comprehensions where the expression of the outer comprehension indexes into the result of the inner comprehension. When this occurs, comprehension elimination may be applied and can produce drastic reductions in memory usage and even computational complexity without sacrificing the clarity and readability of the high-level code. The following sections give two specific examples.

Optimization example: diag(A*B)

Suppose we have the following code:

d = diag(A*B)
Here the matrix product is defined as in the previous section and diag returns the main diagonal of a matrix, also as defined in the previous section. If these functions are written in Julia, they can be inlined, resulting in the following computation:
C = [ sum(A[i,:].*B[:,k]) | i,k ]
d = [ C[j,j] | j ]
These can be combined into a single assignment
d = [ [ sum(A[i,:].*B[:,k]) | i,k ][j,j] | j ]
which can be further simplified via comprehension elimination into
d = [ sum(A[j,:].*B[:,j]) | j ]
thereby avoiding computation of the entire temporary product matrix, C. Not only is the intermediate matrix product never stored in memory, but most of it is never computed at all. This transformation changes what would be an O(n^3) computation in Matlab requiring O(n^2) memory into an O(n^2) computation requiring O(n) memory. While the programmer could make such optimizations manually, most never do, and even if they did, the loss of high-level clarity would be substantial. This demonstrates that allowing operations like matrix product and diagonal to be written easily in native Julia constructs like multidimensional array comprehensions gives a potentially huge advantage for optimizing away unnecessary computations that are found everywhere in typical numerical codes.

Optimization example: sum((A*B)[_,:])

Consider the computation

s = sum((A*B)[_,:])
This computes the row sums of the matrix product of A and B. This may be expanded as
C = [ sum(A[i,:].*B[:,k]) | i,k ]
s = [ sum(C[j,:]) | j ]
which may be combined into a single expression as
s = [ sum([ sum(A[i,:].*B[:,k]) | i,k ][j,:]) | j ]
Comprehension elimination simplifies this to
s = [ sum([ sum(A[j,:].*B[:,k]) | k ]) | j ]
While the computational simplification is not as drastic as the diag(A*B) example, there is still a significant savings since the temporary array, C, is never stored so the algorithm requires only O(n) memory instead of O(n^2) as it would in Matlab. In practice, this reduced memory requirement translates into a significant increase in both performance and scalability.

Optimization example: diag(A*A.')

Consider:

diag(A*A.')
Since this is similar to the diag(A*B) reduction above here are the steps:
diag(A*A.')
 == diag([ sum(A[i,:] .* A.'[:,j]) | i,j ])
 == diag([ sum(A[i,:] .* A[j,:]) | i,j ])
 == [ [ sum(A[i,:] .* A[j,:]) | i,j ][k,k] | k ]
 == [ sum(A[k,:] .* A[k,:]) | k ]
 == [ sum(A[k,:].^2) | k ]

Optimizations on multiple and compound comprehensions

The following describes the process by which indexing and comprehensions are transformed so as to eliminate unnecessary computations. The assumption here is that expressions have no side-effects, which may not be a generally valid assumption; these transformations apply, however, so long as that assumptions is valid.

  1. type inference & inlining
  2. slicification: write all non-scalar expressions as trivial slices:
    A => A[:,:]
  3. deslicification: rewrite all slices as comprehensions:
    A[i,:,k] => [ A[i,j,k] | j ]
    All comprehension expressions are now scalars and will remain so for the rest of the transformation process.
  4. comprehension unnesting: collapse nested comprehensions into a single comprehension:
    [ [ <i>expr</i> | j ] | i ] => [ <i>expr</i> | i, j ]
  5. comprehension elimination: when an expression indexes into the result of a comprehension, eliminate the comprehension:
    [ <i>expr(i)</i> | i=ri ][j] => <i>expr(ri[j])</i>

Generation of for loops from comprehension expressions

Consider this simple comprehension for adding two matrices, C = A + B

C = [ A[i,j] + B[i,j] | i,j ]
This is the code that should get generated from this comprehension:
[m,n] = size(A);
C = zeros(m,n)
for i=1:m
  for j=1:n
    C[i,j] = A[i,j] + B[i,j]
  end
end
Alternatively, a flat loop could have been directly generated from the comprehension, or by using polyhedral transformations:
for k = 1:m*n
  C[k] = A[k] + B[k]
end
With type inference, the ref/set operations to be fully optimized away. If the types of X and Y are the same, even simpler code could be generated:
for k = 1:m*n
  C.buffer[k] = A.buffer[k] + B.buffer[k]
end

Detailed transformation example

Suppose the +(Tensor(t,r),Tensor(t,r)):Tensor(t,r) function is defined in Julia as:

function +(A:Tensor(t,r),B:Tensor(t,r)):Tensor(t,r)
  if r == 0
    return A[] + B[]
  else
    return [ slice(A,r,i) + slice(B,r,i) | i ]
  end
end
The base case is a call to the scalar addition +(t,t):t, where t is the underlying scalar type. The notation A[] accesses the unique value in the 0-tensor, A. The slice() function is assumed here to slice a tensor in a single dimension given by the second argument with the slice given by the third argument. This notation is not yet completely decided.

Now let us expand and transform the expression A + B as the Julia compiler might for a 2-tensor, assuming partial evaluation of the recursion at compile time:

A + B
 => [ slice(A,2,j) + slice(B,2,j) | j ]
 => [ [ slice(slice(A,2,j),1,i) + slice(slice(B,2,j),1,i) | j ] | i ]
 => [ [ slice(slice(A,2,j),1,i)[] + slice(slice(B,2,j),1,i)[] | j ] | i ]
 => [ [ A[i,j] + B[i,j] | j ] | i ]
 => [ A[i,j] + B[i,j] | i,j ]
This could be transformed into a for loop notation as shown above. Thus far, the transformation is implementation independent and could be applied to any Tensor class. For a given implementation of Tensor, such as Array, we can directly operate on the buffer of values, possibly even optimizing the double loop into a single loop. Optimizing that single loop further would likely be left to LLVM.

More complex examples

The above example does not give much sense of the power of this technique because the result is exactly what Matlab would do to compute A + B. Therefore, we'll walk through a few more complex examples here to see the real benefit.

A + B + C
Matlab would compute A + B first, store it in a temporary array, D, and then compute D + C. Let's see what Julia does:
A + B + C
 ... => [ A[i,j] + B[i,j] | i,j ] + C
 ... => [ [ A[i,j] + B[i,j] | i,j ][k,l] + C[k,l] | k,l ]
 ... => [ A[k,l] + B[k,l] + C[k,l] | k,l ]
Where Matlab needs to create a temporary matrix and iterate the summation loop twice, Julia is able to generate code that loops a single time and creates no temporary matrix.

Consider a still more complex expression:

(A + B') * C
Matlab likely has to create a temporary matrix to hold B', then maybe use the same temporary to store A + B' and finally compute (A + B') * C in another pass. In Julia, the above is transformed as follows:
(A + B') * C
 ... => [ A[i,j] + B[j,i] | i,j ] * C
 ... => [ sum([ [ A[i,j] + B[j,i] | i,j ][k,h] * C[h,l] | h ]) | k,l ]
 ... => [ sum([ (A[k,h] + B[h,k]) * C[h,l] | h ]) | k,l ]
Again, no temporaries need be computed, and the operation as a whole requires the same set of loops as just a matrix product.

Parallelization

As long as expressions are free of side-effects, parallelization is easy. For the multicore shared memory case, parallelization is almost trivial, with different threads doing different parts of the comprehension. One question to think about is where in the compiler passes should the decision for parallelization occur? Should it occur after the comprehensions have been simplified and converted into loops? Should it happen after the comprehension simplification but before the conversion into loops? What kind of a parallel runtime is necessary to implement this quickly with low overhead?

The distributed memory case brings a few more challenges. Consider the simple case of the matrix transpose. In distributed memory, this can entail a lot of communication, and clearly, communicating one scalar at a time, although massively parallel, is also massively slow. For distributed memory, perhaps an additional compiler pass may be necessary to deal with locality and distribution information, to schedule all the communication. In any case, a basic Global Address Space (GAS) with non-blocking send/recv is the minimal infrastructure that is necessary for something like this.

References

⚠️ **GitHub.com Fallback** ⚠️