Shape - bobjervis/parasol GitHub Wiki

Overview

Parasol incorporates collections as first-class objects. For languages like C or Java, collections are primarily manipulated one element at a time or through libraries of similarly written code.

In Parasol, one of the primary means for expressing parallelism in a program is through the use of shape. Formally, a shape is a mapping from an N-tuple of keys (K0, K1, ..., Kn) to a value space (V). A shape is implemented by a template class, parameterized by the key and value types. A shape class is responsible describing how the key space is arranged in memory so that the compiler can make intelligent decisions about generating code that manipulates objects with that shape.

Currently, the only shapes defined are vectors and maps. A vector is a contiguous array of key values from 0 to N. A map is a sparse array of any sort of key value (as long as the key type implements some kind of comparator and a hash method). So far, these classes have minimal implementations and are not related in any way. Whatever special recognition they receive is entirely hard-coded into the compiler.

The current phase of the Parasol project finally entails getting real about how to build out shapes.

Shaped Expressions

An expression in a language like C or Java one typically expresses the computation of a single scalar value. In order to take advantage of parallelism in algorithms, we look for ways to express algorithms that captures such intent. One way to do this is to continue to write scalar expressions and manually construct a threading model, or use some distributed computational framework like an MR (Map-Reduce). While it is possible to write this style of code, Parasol incorporates vectors and other shaped collections of objects to provide a more direct expression of the intent, without having to fixate on the details.

It should be noted that some of these design ideas originate in a language called APL. Created in the early 1960's, APL has been kicking around universities and hobbyists ever since, but has never caught on. The fatal flaw of APL is that it uses a unique font to write programs and only a handful of IBM Selectric typewriter-based terminals were ever outfitted with both a keyboard and type-ball for the language. Every other piece of computer equipment requires cumbersome and annoying conversion. The language is also not strongly typed nor does it possess a substantial array of modern language features.

It does have the basic design notion that in any expression where you would combine scalar numeric values, you can substitute vector values and the operators will simply be applied element-by-element across the vector. It also included a variety of useful operators as well as the powerful concept of a reduction.

More recently in the early 1990's Thinking Machines, Inc. created a dialect of C called C* that included built-in vector computations that were extended to include distributed arrays. Being a C dialect, C* also had the advantage of many of the features of C missing from APL. Unfortunately, Thinking Machines did not survive and their technology was abandoned.

Today, any self respecting data center has the computational horsepower of a Connection Machine, but we still use scalar programming languages.

The Parasol Design

The basic feature that Parasol recognizes is the notion that an expression that would otherwise include scalar values can use shaped values instead. Because vectors are a kind of shape, this concept is a generalization.of the rules APL employed. The designers of C* were constrained by the need for C compatibility, where Parasol is not constrained by such needs.

So, in Parasol, all basic arithmetic and logical operators, function calls and control flow can take either scalar or shaped operands. If a scalar operand confronts a shaped operand across a binary operator, the scalar operand is converted, by replication, to a compatibly shaped collection. If two shaped operands confront one another, if they have the same shape, the operator is applied element-wise and the result of the expression is a shaped value. If two shaped operands do not have the same shape, then the operation is disallowed.

This last point is worth noting. As the number any complexity of shapes broadens, it may become desirable to convert, for example, from simple single-machine vectors to distributed vectors and back in order to carry out some larger computation. By restricting the manner of mutating data shape at least for now, I can focus the feature set and complete a useful facility.

Shape Conformance

One of the key design issues with arrays going back to the days of Pascal and beyond is the question of how well libraries can be built to handle various sizes and shapes of array. Pascal made the wrong choice by insisting that for arrays to be conformant (for Pascal meant that they could be passed as parameters to a function) they had to have not only the same element type, but the same size as well. C had such a weak notion of array as an entity that it was easy to write a conformant library. The only real problem is that the array wasn't self-describing so the programmer had to communicate the array size explicitly as a different parameter. And C still had the problem that for multi-dimensional arrays, you only got a free pass on the first one.

Earlier languages such as Algol had gone to great lengths to record array dimension information at runtime to allow libraries to process arrays of arbitrary dimension and size. Of course, the array indexing code was not that fast because the runtime information had to be consulted at runtime. Such is the trade-off involved in dynamic array element calculation.