Problems With Existing Languages - Zistack/program-modeling-language GitHub Wiki

Why a new language, rather than a new analysis over an existing language? There are a few major problems with existing languages.

Encoded data is the same as unencoded data

Most languages don't bat an eye at embedding an encoded datastructure into the program. This turns out to be a really big problem.

It's necessary to be able to manipulate encoded data. If we ever want to talk to the real world (though sensors/actuators/network), we're probably going to be encoding/decoding some data packets/streams.

However, embedding encoded data directly into a program is another story. Strictly speaking, this never needs to be done. All data in a program should be represented directly in terms of a language's data structure primitives, and not encoded as strings or some other encoded format.

The reason that we don't want to allow this is because it allows the programmer to embed programs written in an unanalyzable language into their code. It's one thing when we're taking those encoded programs as input from somewhere else (this will be how compilers, interpreters, and language analysis tools work), but they should never be hard-coded into another program.

As such, any data that passes through an encoding/decoding barrier needs to be treated especially carefully. We should generally expect that encoded data could be anything that the encoder/decoder could produce/digest. This way, we will not attempt to reason about what a specific piece of encoded data will do to a program's behavior.

To do this, we need to have a clear idea of the distinction between 'encoded' and 'unencoded' data, which existing languages don't even attempt to reason about.

Pointer arithmetic

First, pointer arithmetic makes it practically impossible to reason about the structure of data, in general. While functional programming languages do not support memory accessed in this way, it is trivial in a functional programming language to construct such an abstraction by using a map from integers to whatever. The problem is that arithmetic itself is too opaque to be used as the backbone of defining relationships between data elements. It is an implementation of semantic ideas, so the semantics have to be reverse-engineered from the implementation. This reverse-engineering is computationally expensive, especially when the implementation is wrong and the nice properties that it should have aren't actually present.

We have to remove pointer arithmetic, which also means removing arrays. By doing this, we force any attempt at reconstructing pointer arithmetic to rely on crossing the data encoding barrier, as data whose structure is represented by its layout in some memory datastructure is necessarily encoded. By forcing such an activity to use that feature, we essentially give up on any hope of analyzing the exact nature of what is being done, and instead reason in terms of generalities about the overall space of possibilities. The programmer will be discouraged from using such a feature for their internal datastructures, since the compiler will not be able to understand what they are doing. This works so long as we provide an alternative method for defining datastructures which is sufficiently powerful that nothing else is ever needed. More on that in the next section.

Structure as defined by implementation rather than as defined by meaning

Even if all of our indirect references are instead represented by opaque pointers (that is, datastructures are just graph structures), we still cannot properly reason about their shape. Graph grammars don't work (determining equivalence of two context-free/sensitive (or stronger) graph grammers is undecidable), and separation logic is too expensive to use and is full of potholes into the void (more decidability issues) that you can fall into.

K-dimensional grids are a particularly interesting example. They require graph grammars that are too powerful to reason about to be represented at all, yet they're fundamentally a very simple structure. When implementing grids, we typically rely on arrays (pointer arithmetic) to implicitly define for us the invertable relationships that touch the adjacent cells in each dimension, as well as the various invariants that must hold about paths through the grid. We need to make those relationships explicit in the program in order to talk about them in the absence of pointer arithmetic. We also need some other way of talking about the path invariants, because global reasoning about paths through strucuture is very difficult.

Well, what if we got rid of the idea of paths in general? Let's think about how you draw a grid on a piece of paper (left-to-right, top-down). You start with a line. Then another identical line next to it. Then another. Eventually, you have enough identical lines, and you draw another line perpendicular to these lines. Then you draw another line identical to that one below the first one, and another, and another until you have enough for your grid.

We can make this analogy in programming. Each line is just a trace of a total order across one dimension of the cells. Morover, parallel lines are all referring to the same total order. A grid is just a geometric product of two total orders.

This description of a grid is about the meaining rather than the implementation. Our implementation of a grid under the hood might well still be pointer arithmetic, or the equivalent of a 2-dimensional doubly-linked list, or whatever. Regardless, the relationships represented by all of those structures are the same - the geometric product of two total orders.

But we could think of other basic relationships than total order, and we can think of other compositions than geometric product. It turns out that this is the most simple and flexible way to reason about the shapes and meanings of datastructures - but it completely invalidates the way that we even talk about data in existing programming languages. You cannot operate this way and define what an individual node looks like. Instead, you are describing what a datastructure looks like by describing the geometric composition of some fundamental relationships between nodes.

The nesting of one datastructure inside of the nodes of another is a composition that we use all of the time.

We also have datastructures that don't represent total orders, but instead represent partial orders, or even arbitrary graphs.

Match expressions

Separately, the use of things like match expressions to enumerate low-level logic functions ends up having a similar problem to pointer arithmetic. They don't directly interfere with reasoning about structure, but they do encode (NP-)hard problems for the compiler to solve very easily. We want to avoid enumerative approaches to defining functions in all cases, so match expressions and any of their derivatives (eg. if/else, switch) or any way of constructing them (eg. high-order-functions with dictionary types) cannot be present in the language.

Instead, we need to reason structurally about the relationships between our logic values (which are generally provided by enumeration types). The logic values define a value space, and that value space generally has some sort of algebraic structure (think like rings and stuff) that provides some basic relationships between the different values in the value space. Relationships between compositions of value spaces are defined in terms of the relations provided by the algebraic structures of those spaces.

This way, all of our reasoning is structural rather than enumerative, both in the program, and in the compiler. This is much more tractible. We just need to ensure that we provide a sufficiently powerful set of algebraic structures and composition rules to rely on so that the programmer never finds themselves unable to specify an appropriate model for their problem domain.

Non-canonical feature sets

Lastly is the issue of having too many different ways to write the same piece of code. This particular issue can't really be pinned down to any particular feature or family of features, but rather to the philosophy that we typically use when devising the overall set of features. We really need to ensure that there is fundamentally only one way to represent any given type of relationship.

Since encoded programs are already handled in a special way, we don't have to worry about the fact that our language will be powerful enough to build any language feature from any equivalently powerful language.

Instead, we just have to worry about instances where we provide two primitives (or sets of primitives) that allow us to perform the same fundamental computation. For example, looping and recursion are both features that can be used to encode the same sorts of computation. We must ensure that only one such feature can be used in any given situation.