Qi Meeting Apr 11 2025 - drym-org/qi GitHub Wiki
Two Directions in Deforestation Architecture
Qi Meeting Apr 11 2025
Adjacent meetings: Previous | Up | Next
Summary
We discussed next steps on deforestation architecture, specifically two aspects to achieving a faithful representation of stream fusion in the core language. We also discussed ways that we might "tap" into the flow and see its holistic operation for debugging purposes, and a "divide" with independent compilation representing a challenge in multi-language settings.
Background
We've recently been debugging a thorny performance issue, but we decided to set that aside for the moment and return to working on deforestation. We could always return later to continue the "fantastic voyage," with more intrepid explorers who may be interested in the debugging journey. If you'd like to sign up for this bold and perilous adventure, let us know! In the meantime, we must return to main project directions.
Observation: Deforestation in Qi is Opaque
The way our deforestation pipeline works right now is, it first expands surface level list operations like map, filter and foldl to a core form called #%deforestable. This core form actually gets deforested in the deforestation pass of the compiler, which translates a sequence of such operations into a single fused operation. That fused operation is expressed as an escaped lambda, making it, effectively, Racket code. Ultimately, in the final stage of Qi compilation, code generation to Racket is trivially performed by simply unwrapping the esc.
We noted that with this approach, once the fused operation is produced, the representation is totally opaque. It doesn't convey any clues regarding the sequence of operations to be performed, and the code itself is copious and dense continuation-passing spaghetti that isn't really human-readable. At a minimum, this poses challenges for debugging, but we wondered whether there might be other reasons to seek an alternative approach here.
Representing Fused Streams in the Core Language
We considered the possible addition of a new form, #%deforested (not deforestable but deforested).
This form would essentially model the composition that's already defined on fusable stream components. That is, map and filter are fusable streams, and they may be composed --- under certain conditions (in this case, satisfied) --- to yield another fusable stream. Instead of performing the composition and code generation in a single step (as we do today), we considered that we might separate it into two steps: abstract stream fusion of this kind, followed (much later) by code generation.
This might resemble, in the deforestation pass:
(~> (range 5) (map sqr) (foldl + 0))
(after expansion) being translated to:
(~> (#%deforested
(#%deforestable range info [expr 5])
(#%deforestable map info [floe sqr])
(#%deforestable foldl info [floe +] [expr 0])))
… where #%deforestable is the core form to which each stream component expands, and info is the compile-time metadata carrying the semantics of each specific operation being performed (e.g., the semantics of map), which we may ignore for the moment. This translation would be performed using the usual find-and-map/qi utility which traverses the syntax tree to apply optimizations regardless of depth — so the manner of optimization remains the same, just that we hold off on code generation.
A benefit is that, as #%deforested is a core language form, subsequent stages of compilation would still find it amenable and could leverage it or rewrite it. It would also be more transparent for debugging purposes, either manually or by using a specialized debugging tool (discussed later, below).
Finally, at the code generation stage, we would compile #%deforested to a Racket macro which would generate the continuation-passing based deforested runtime using the existing generate-fused-operation function. In other words, the deforestation pass produces #%deforested forms, and the actual deforested runtime is produced from these at the code generation stage.
We noted that, in fact, the qi0->racket macro which facilitates code generation is already a Racket macro, so it would just be a matter of extending it to include the deforested runtime. As a comment in the code on the subject notes:
;; note: this does not return compiled code but instead,
;; syntax whose expansion compiles the code
(define (compile-flow stx)
(run-passes stx)))
Reflecting on the course the discussion was taking, Dominik quipped that this comment expresses our mission statement for the next phase of deforestation!
Representing Stream Components in the Core Language
For symmetry, we also considered the idea to broaden the cryptic #%deforestable, which represents fusable streams, into an explicit encoding of such fusable streams in the core language as #%producer, #%transformer and #%consumer.
Together with #%deforested, this would allow the deforestation pass to be expressed as a composition of such forms within the core language.
In this connection, it's useful to recall Michael's advice from some time ago, that the API that allows extension of deforestation (i.e., define-deforestable) should fully specify the semantics of the new operation, both naive and deforested. Today, we only specify the naive semantics via this form (by providing an explicit Racket implementation), and, essentially, hardcode awareness of standard operations in the deforestation pass by matching on their names (e.g., filter) in order to generate the corresponding deforested runtime.
In taking the next step, we may rely on a property that our implementation of deforestation to some extent already encodes: any sequence operation may be expressed as a parametrization of an underlying stream runtime. So far we have discerned precisely three of these core runtimes --- producers, transformers, and consumers --- and we have discerned abstract composition rules on these types of streams. We could broaden define-deforestable to allow specifying these parameters together with the appropriate stream type (e.g., producer), to achieve the full semantic specification we seek.
Yet, if the set of parameters together with the stream type is sufficient to define deforested semantics, then it's likely that it must be sufficient to define naive semantics, too. So instead of having two separate encodings of semantics in the define-deforestable form, we could just have one: the set of parameters + the stream type. We could further simplify this by introducing three defining forms, define-producer, define-transformer, and define-consumer, which implicitly encode the stream type, so that only the set of parameters relevant for that type needs to be specified. These could either expand to the existing #%deforestable form, or could be one-to-one with new core forms, #%producer, #%transformer, and #%consumer. The latter seems simpler, especially if we intend to retain these as distinct core language concepts all the way through code generation.
Putting It Together
These two directions seem to bring things into focus. We are engaged in identifying sequence operations that we can model as streams, a standard concept in Scheme (though in this case unrelated to the standard, lazy implementation using stream-cons and empty-stream). Such streams are "fusable," that is, they are composable in certain abstract ways. These streams are increasingly becoming a core part of Qi's semantics, and seem to warrant a faithful representation in the core language.
Perhaps we should name the proposed form for fused sequences simply #%stream, to reflect this.
With these proposed changes, the earlier expansion for range→map→foldl:
(~> (#%deforested
(#%deforestable range info [expr 5])
(#%deforestable map info [floe sqr])
(#%deforestable foldl info [floe +] [expr 0])))
… would instead be:
(~> (#%stream
(#%producer info [expr 5])
(#%transformer info [floe sqr])
(#%consumer info [floe +] [expr 0])))
We no longer need to retain the name of the form (e.g., range) in the expansion, as we are planning to remove the hardcoded identification by name from the deforestation pass. The info data structure would no longer contain a naive runtime, but instead, a unique parametrization defining the operation. This parametrization would determine both deforested as well as naive semantics, handled via the distinct compilations of #%stream, #%producer, #%transformer, and #%consumer — all at the code generation stage.
When we initially implemented forwarding of compilation metadata via the info data structure, we had assumed that the alternative would involve introducing a large number of core forms, or perhaps encoding runtimes explicitly (i.e., actual Racket code) in the core language. This was undesirable aesthetically as well as from a performance standpoint as it would bloat the size of generated code.
But if the unique parametrizations that we are now converging on are small enough, it may be feasible to encode them in the IR after all, as a final possibility:
(~> (#%stream
(#%producer (param ...) [expr 5])
(#%transformer (param ...) [floe sqr])
(#%consumer (param ...) [floe +] [expr 0])))
Of course, this is feasible if our assumption that a parametrization uniquely determines both deforested as well as naive semantics is true. Otherwise, encoding full runtimes in the info struct would remain our best option.
Small Steps
We felt that these directions were very promising, and could greatly reduce complexity in the compiler by clearly separating concerns, but we also wanted to decompose these directions into bounded, and ideally, small, steps.
We discussed some ways to do that. The basic components seem to be:
- Specify the deforested (in addition to naive) runtime using
define-deforestable. - Translate
define-deforestableintodefine-producer,define-transformer, anddefine-consumer(initially, just write these as macros expanding todefine-deforestable) - Don't rely on the name of the form in the deforestation pass to select a specific, hardcoded runtime (e.g.,
list-ref-cstream), and instead, generate it from metadata captured fromdefine-deforestable. - Decompose
#%deforestableinto#%producer,#%transformerand#%consumer. - Introduce
#%streamas the output of the deforestation pass, from rules composing sequences of#%producer,#%transformer, and#%consumer. - Extract the generation of the deforested runtime from the deforestation pass as ordinary code generation (i.e.,
qi0->racket) for the new#%streamform. - Add corresponding code generation for
#%producer,#%transformer, and#%consumerthat leverage the same metadata to generate a naive runtime.
All together, it's sizable, but we discussed that writing standalone code for each aspect before integration could help reduce the complexity, and there may be ways to parallelize the effort.
Probe ... and "Tap"?
One reason this even came up is that Dominik has very high standards and has always looked upon the probe tool with disdain.
Now, this tool is of course handy for triangulating issues with flows by moving the probe to different points in the flow and analyzing the values there. But certain critics say, why even go through the trouble of pinpointing the problem this way? The structure of the flow is already explicit, and, given the inputs, the values at every point are already determined.
It's like looking at tiny rivulets and vortices individually, when we could be looking instead at the river.
It would be nice if we could allow the ability to "tap" into different points in the flow and then just show the overall structure, together with the values at the various points of interest (or even all points). This should ideally even reveal details of compiled structure which may differ from surface structure in the case of deforestation, which causes the "Schrodinger" phenomenon afflicting probe.
To do this, it might be necessary to write "tap" as a compiler pass (using the internal compiler registry API) which constructs a data structure (such as "a DAG, with leaves that may be stacks"). This structure would contain the runtime data encoded along with the structure of the flow. If the structure is retained by our deforestation pass (as per our foregoing discussion) rather than compiled away, then we would be in a position to work with and show this structure to the user.
If the taps are disabled, then they would potentially just get compiled out, as they would be effects on top of multi-valued identity flows (although, the effect form may have special handling in the compiler to honor explicit effects even if they are tied to identity flows --- this is still TBD and depends on our theory of effects). When the taps are enabled, they would produce the desired data structure for analysis and presentation.
In this way, we could have a low-level, holistic, debugging tool that is aware of compiled structure and thus free of "quantum" effects.
The Independent Compilation Divide
The "tap" debugger would especially shine in larger flows. But typically we encourage writing and composing small flows.
In fact, there is a deeper problem with this compositional approach, which is that independently compiled Qi flows cannot be mutually optimized. For example:
#lang racket
(require qi)
(define flowA (flow (~> (filter odd?) (map sqr))))
(define flowB (flow (~> (take 10) (foldl + 0))))
(define flowC (flow (~> flowA flowB)))
(define-flow flow1 (~> (filter odd?) (map sqr)))
(define-flow flow2 (~> (take 10) (foldl + 0)))
(define-flow flow3 (~> flow1 flow2))
Here, flow3 is effectively (~> (filter odd?) (map sqr) (take 10) (foldl + 0)), which would, if it actually appeared this way, fuse into a single stream for the greater good. But because flow1 and flow2 are separately defined and thus independently compiled, their representations within flow3 are opaque and they are not fused.
Compiling Multi-languages?
Hypothetically, in a #lang qi, we would have nonlocal control over the entire processing of source code. In principle, this could allow us to do some static analysis after the read stage on the whole module, and perhaps inline flows so that they can be compiled together, which would resolve the issue in the above example. It feels a little hacky to do this in an ad hoc way, however, as any such rules that we write would be "extrinsic" to the languages they operate on, unlike traditional compilers. So maybe it would make sense, in the context of language oriented programming, to define a formal "translation" stage as part of the parsing lifecycle, in between reading and expansion, which would be able to reason formally about the code in a multi-language setting for the purpose of smoothing over such emergent cases.
Using Compile-time Metadata
A more tangible option we considered that could achieve something like this is to study the approach taken by Laurent Orseau's define2 library.
In this example:
#lang racket
(define (my-func a #:asdf (asdf 1234))
(displayln a))
(my-func 1 #:unknown-key 321)
The use of my-func leads to a runtime error, as there is nothing syntactically wrong with the code.
But adding a simple require at the top:
(require define2)
… causes it to become a compile-time error. This is achieved by employing advanced #%app and #%datum macros that convey compile-time information about the definition to the use site. It's possible that studying the approach taken here would allow us to modify define-flow to similarly propagate information nonlocally in a source module, which we could perhaps employ to achieve such fused compilation. In this case, only flows within the scope of a define-flow would be able to achieve this kind of nonlocal compilation, and not flows defined in any other way such using define or even inline with other flows.
Other Topics of Discussion
We also discussed our feeling that there's a real niche for a toplevel Scheme-based flow-oriented language, which could seamlessly employ the various flow backends (functions, futures, CUDA, AWS Lambda, etc. — and even allow users to easily add new backends) in a single language.
We also considered Qi's application to study neural networks (but not necessarily for practical implementation, as matrix-based brute force GPU computations are the most performant paradigm here, and that doesn't necessarily suggest flow-orientation), writing CUDA primitives that Qi could compile to, and discussed many other interesting things.
Next Steps
(Some of these are carried over from last time)
- Take the next small steps on deforestation architecture.
- Define the
define-producer,define-transformer, anddefine-consumerinterface for extending deforestation, and re-implement existing operations using it. - Write more reliable nonlocal benchmarks (in
vlibench) - Undertake a "fantastic voyage" to get to the bottom of the performance puzzle in using
compose-with-values. [see also: PR #191] - Implement more fusable stream components like
drop,append, andmember. [issue #118] - Why is
range-map-carslower against Racket following the Qi 5 release? - Ensure (eventually) that both codegen and deforested runtimes are included in the info struct and not in the IR.
- Resolve the issue with bindings being prematurely evaluated. [issue syntax-spec#61]
- Fix the bug in using bindings in deforestable forms like
range[issue #195] - Write a proof-of-concept compiling Qi to another backend (such as threads or futures), and document the recipe for doing this.
- Review Cover's methodology for checking coverage (e.g. wrt. phases) [related issue: #189]
- Improve unit testing infrastructure for deforestation.
- Decide on appropriate reference implementations to use for comparison in the new benchmarks report and add them.
- Decide on whether there will be any deforestation in the Qi core, upon
(require qi)(without(require qi/list))
Attendees
Dominik, Sid