Into the Depths

Qi Meeting Mar 21 2025

Adjacent meetings: Previous | Up | Next

Summary

We continued investigating last week's performance puzzle and found few answers and more questions.

Background

Last week, we discovered that using call-with-values in the Qi compiler vs using it in a standalone script produced subtly different "linkets" when compiled, but that they eventually converged at the CP0 stage of compilation (a later stage). This suggested that, in principle, the results seen in standalone benchmarks should also be seen in Qi. Specifically, we should expect to see the performance improvement over using compose. We did not see this, and instead, the crude benchmark we were using for the purpose appeared to show worse performance.

Picking Up the Scent

First, we summarized the factors affecting performance of composing using Qi's ~>:

Qi supports variadic runtime arguments and not a fixed number.

A fixed number would perform better, at the cost of flexibility.

The composition of functions done by ~> is evaluated each time before being applied.

Instead, if we could evaluate the composed flow just once and then apply it many times, that would also improve the benchmark result.

But ultimately, we agreed that these concerns are orthogonal to the issue we observed last time, where both of these factors are controlled, kept constant between the standalone and the Qi implementation, and yet, the observed performance is different in the two cases.

So what exactly is going on there?

Establishing the Facts

We could continue benchmarking two totally different implementations by modifying the code and re-running our crude benchmark each time, but that would take forever and give us the same dubious results we've been working with. We needed reliable data now. So Dominik had the brilliant idea to use our super secret internal "API" for overriding the compiler, the compiler registry, for this purpose.

The compiler registry is a simple compile-time list of Qi's compiler passes, where each member is an instance of a data type that represents a compiler pass. Each pass has an associated macro that fulfills the code transformation, along with a priority — a simple number — which determines its position in the order of compiler passes. This registry is mutable, which means that we can change the order of compiler passes, or insert new passes, at the user level, without touching the compiler code, simply by using (define-and-register-compiler-pass my-pass-name my-priority rewrite-rules ...)!

The same flexibility that makes it risky to expose externally (which is why it's "secret") is the very thing that makes it a powerful tool for Qi developers, as it can greatly simplify working with the compiler. Which is exactly what we needed to do right now.

We wanted to rigorously benchmark the two versions of the code — compose vs call-with-values — as they would be used in the compiler itself, using the vlibench benchmarking tool. This tool performs a large number of runs with a large number of input values. It is statistically rigorous and also accounts for ambient factors like garbage collection that may affect performance. We modified qi-sdk/benchmarks/competitive/report.scrbl (the entrypoint to vlibench) to include just two modules. One of them simply defined a flow like:

(define-flow comp (~> add1 sqr sub1))

While the other did the same, but also used define-and-register-compiler-pass to insert a new compiler pass prior to the final, code-generation ("Qi0 → Racket"), stage of the Qi compiler (which produces Racket code). This ad hoc pass simply matched syntax like (~> f ...) and rewrote it to the call-with-values implementation instead of the compose implementation that would otherwise be done at the Qi0 stage.

These were the results:

(full results --- ignore the title of the page)

This confirmed that, indeed, instead of being 2x faster, the call-with-values implementation is markedly slower than the compose version, when used in the Qi compiler.

Handholds on Reality

Now that we'd established that this result was legitimate, we decided to dig deeper on the compilation output, to see whether there is anything clever happening in the compose version that might account for its superior performance. We used this module:

#lang racket/base

(require qi racket/math)

(define test-threading (flow (~> add1 sqr sub1)))

(test-threading 3)

And boy, did we go deep. Not content to look at CP0 output, we used the PLT_LINKLET_SHOW_PASSES=all flag to see the entire sequence of the stages of the code right down to the machine level code produced by Chez Scheme!

We noted that the function application was the last line of the source module, and as it was applied to the argument 3, we were (that is, Dominik was) able to track the function application through many levels of unreadable nests of parentheses and increasingly austere identifiers. At some point, the number 3, our one handle on reality, suddenly disappeared without a trace.

Man and Machine

We fumbled around looking for purchase, but all we could find was the mysterious number 24. Wait, was this actually the output of the program?? Maybe it was inferring the result at compile time to avoid doing it at runtime? That would be clever, indeed.

But we realized that, in fact, (3+1)²-1 is 15, not 24. It would have been 24 if the input were 4, but that wasn't the case.

Then it occurred to us that this was an instance of "tagging," where the compiler is using the first few bits of a value as a "tag" of some kind, and in order to do that, it left-shifts the value in a bitwise way by a certain number of places. In this case, that's 3 places. 3 (00011) left-shifted by 3 (11000) is 24! So the mysterious 24 that had appeared was just our original 3 in another guise.

This became our new handle on reality, as we descended further and further, into ever-deeper levels of the Racket and then Chez Scheme compiler. One compiler pass we came across was named np-unbox-fp-vars, which was added by Matthew Flatt to overcome a particular challenge with floating point numbers in the transition to Chez Scheme. We continued going deeper. Lambdas turned to lets, anonymous functions gained names, closures came and went, "ruined paths" were bound and redeemed, attachments were recognized and unbound, case-lambdas flattened out of existence. At one point, we encountered a single line that was 22,000 characters long and full of mystical symbols. But we persisted, down through all the strange, inscrutable levels, until we came at last to something that looked like assembly code.

At this point, Dominik exclaimed, "Yes, show me this code!" Adding, "... ideally in a more readable form." In these loads and stores and register references, Dominik seemed to see function calls and answers. There, he seemed to find meaning, written clear as jczowy6yjfz400ntojb.

He summarily announced, "Without a doubt, there is nothing clever going on here!"

Evidently, after all was said and done, it was composing the functions and applying them to variadic input arguments, essentially as written in the source module.

So if compose is faster, it isn't because of anything fancy going on here. Instead, it must be because of something unclever happening in the call-with-values compilation in the way we are using it. What might it be? We don't know. The full compilation output of the module using call-with-values was something like 52 MB, and we weren't feeling up to diving into it just yet. Perhaps next time!

Next Steps

(Some of these are carried over from last time)

Write more reliable nonlocal benchmarks (in vlibench)
Get to the bottom of the performance puzzle in using compose-with-values.
Define the define-producer, define-transformer, and define-consumer interface for extending deforestation, and re-implement existing operations using it.
Implement more fusable stream components like drop, append, and member. [issue #118]
Why is range-map-car slower against Racket following the Qi 5 release?
Why is the benefit of call-with-values in isolation not manifesting when incorporated into the Qi compiler? [PR #191]
Ensure (eventually) that both codegen and deforested runtimes are included in the info struct and not in the IR.
Resolve the issue with bindings being prematurely evaluated. [issue syntax-spec#61]
Fix the bug in using bindings in deforestable forms like range [issue #195]
Write a proof-of-concept compiling Qi to another backend (such as threads or futures), and document the recipe for doing this.
Review Cover's methodology for checking coverage (e.g. wrt. phases) [related issue: #189]
Improve unit testing infrastructure for deforestation.
Decide on appropriate reference implementations to use for comparison in the new benchmarks report and add them.
Decide on whether there will be any deforestation in the Qi core, upon (require qi) (without (require qi/list))

Attendees

Dominik, Hunan, Sid, Stephen

Qi Meeting Mar 21 2025 - drym-org/qi GitHub Wiki

Into the Depths

Summary

Background

Picking Up the Scent

Establishing the Facts

Handholds on Reality

Man and Machine

Next Steps

Attendees

⚠️ GitHub.com Fallback ⚠️

Qi Meeting Mar 21 2025 - drym-org/qi GitHub Wiki

Into the Depths

Summary

Background

Picking Up the Scent

Establishing the Facts

Handholds on Reality

Man and Machine

Next Steps

Attendees

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️