PaperDraft - nponeccop/HNC GitHub Wiki

Can humans maintain generated code?

Abstract

We develop an intermediate high-level language based on macros, type inference and an effect system to generate human-grade C code by source-to-source translation. There are three intended application areas: creation of lightweight compilers, software re-engineering and fighting language conservatism.

Lightweight compilers

Compilers tend to be very large in lines of code. To reduce size, one approach is to reuse an existing compiler by translating the new language into an old language. However, the generated code is usually non-idiomatic and we conjecture that compilers don’t optimize non-idiomatic code well.

Our goal is to generate idiomatic high-quality programs so the underlying compiler does its best.

Software re-engineering

As software decays, old systems are rewritten using new languages. Incremental approaches, when the system is rewritten piece by piece, have been shown to have advantages over non-incremental ones. Incremental re-engineering of legacy systems suffers from slow or coarse-grained interoperation between old and new languages. We conjecture that the best way of language interoperation is through source translation from a new language into idiomatic code in old language.

We plan to implement a modern language suitable for re-engineering legacy C code. Many previous attempts have largely failed, but we hope that idiomatic maintainable output C code was the key component those brave people missed.

Language conservatism

Adoption of advanced languages takes decades. We conjecture that an ability to develop using a new language behind the scene while pretending that an old language was used will help the industry adopt new languages, especially in such conservative communities as C developers.

We believe that an ability to show generated idiomatic code in an old language is crucial for this pretending to be practical.

Semantic gap cannot be big

We observe that if a compiler does simple transformations such as changes in type system or syntax, it’s easy to generate idiomatic code.

However, if major changes are to be done such as a CPS transform or replacement of garbage collection with manual memory management, it becomes unclear how to get anything that resembles code written by a human.

Optimization

Our experience shows that a shallow almost purely syntactic translation gives poor results, which can be significantly improved by an optimizer.

First, C optimizers are designed to deal primarily with first-order code with few control abstractions. So feeding them with a straightforward translation of functional code gives poor performance.

Second, control abstractions bloat C code. It's the reason why they are not used by humans in the first place.

So a separate optimization step is necessary. We develop a novel optimizer which preserves identifier and improves readability of output code by making it more idiomatic.

Intermediate languages

We take an incremental bottom-up approach to build a language which can be idiomatically translated into C. We plan to develop a sequence of intermediate languages. So far only a prototype of the simplest of the languages is built.

The bottom language we are working on is just a slight modification of C. We added a new syntax, a new type system with polymorphism and type inference, weak hygienic macros for control abstraction and are investigating an effect system for a slightly better memory and effect management than C has. Note that we cannot change how memory is allocated or freed as it’s a non-trivial step, and we cannot use a conservative tracing garbage collector or reference counting as they are not idiomatic in C.

Scripting subset of C++

While studying an aged legacy system written in C++, we noticed that most of the recent maintenance changes are made in glue code – the code that just calls functions but neither defines its own data types nor deals with low-level bit twiddling.

So we found a pretty small subset of C++ suitable for reimplementation as a new language. Our language has the following constructs: RAII-style local binding with let-style lexical scope, non-recursive function definition (higher-order functions are implemented as hygienic macros), while loop, assignment, if statement, compound statement. We support lexical closures and high-order use of functions, but they are completely eliminated in the output code.

Imperative list folding example

We noticed that function inlining (which can be seen as macro expansion) preserves human maintainability If special care is given to generate meaningful identifiers at the call site. Here is an example in our language as we see it:

fold f x0 l =
	x = x0
	p = l
	while *p != NULL
		x := f x *p
		p := p->next
	x		

x = fold (\x y -> x + y + 1) 0 l

We plan to expand the code into this C fragment:

int x = 0;
list_int *p = l;
while (*p != NULL)
{
	x += y + 1;
	p = p->next;
}

Prototype implementation

There is an incomplete prototype at http://github.com/kayuri/HNC/wiki written in 3000 LOC in Haskell and an attribute grammar DSL. We believe that when complete, the implementation of the first language will take no more than 6000 LOC.

To achieve this brevity, we used Haskell, UUAG attribute grammar preprocessor from Utrecht Haskell Compiler, HOOPL optimization library from GHC, Parsec parser library and Control.Unification library for generic first-order structural unification.

Why C++?

We reused some C++ features

Implementation overview

TBD

Future work

TBD

Selected related works

PureScript

Rust