A Puzzling Performance

Qi Meeting Mar 14 2025

Adjacent meetings: Previous | Up | Next

Summary

We investigated a performance puzzle where compiling the threading form ~> to call-with-values instead of compose wasn't showing the results we expected, and came upon some insights. We also discussed some ideas for integrating Qi and Rhombus.

Background

Some time ago, in discussion on Discord, Sam TH suggested that compiling ~> to a chain of call-with-values would perform better than using compose. He provided some standalone benchmarks that demonstrated the predicted improvement. But when Sid incorporated it into the Qi compiler, the basic "smoke" benchmarks in the Qi SDK did not show the expected improvement, and we didn't have any obvious suspects.

Rhombus Integration?

We've recently been talking about language facilities like list comprehensions and deforested higher-order functional sequences, and how Qi could provide the latter for Racket-based languages, including Rhombus.

We've also previously discussed the possibility of using the flow abstraction in different settings so that it is backed by different, isomorphic, runtimes, like flows as functions in ordinary settings (usual Qi) vs as futures in multi-process settings (perhaps signaled syntactically, by use of a different form).

Many of these are well-bounded and seemingly orthogonal directions, yet, they seem as though they could at the same time be harmoniously woven together by a single toplevel #lang. Equally, a toplevel #lang could profitably integrate specific semantics in certain settings without a "full" integration. For instance, it could use just deforested list or sequence semantics ("transducers"), or just future-backed flows, in specific contexts.

We considered that perhaps Matthew and the Rhombus team would be interested in hearing about these ideas as that project gets closer to launch, in case an integration seems desirable and any attendant considerations may be relevant at this stage.

A Performance Puzzle

Guide to reading this section: Many of the code samples below are long, but usually right after them, there is a summarizing observation pointing out the important bits we noticed.

The Facts of the Case

Regarding the performance puzzle mentioned earlier, Sam had originally provided the following benchmarking script to prove the predicted improvement from using call-with-values:

(define (add1 x) (+ x 1))
(define (sub1 x) (- x 1))

(define (f1 x)
  (add1 (sub1 x)))

(define (f2 . x)
  (call-with-values
   (lambda ()
     (call-with-values
      (lambda ()
        (apply values x))
      sub1))
   add1))

(define (f3 x)
  ((compose add1 sub1) x))

(define (f4 x)
  ((compose1 add1 sub1) x))

(define N 100000000)

(collect-garbage)
'app
(time (for/fold ([k 0]) ([i (in-range N)])
    (f1 i)))
'call-with-values
(time (for/fold ([k 0]) ([i (in-range N)])
    (f2 i)))
'compose
(time (for/fold ([k 0]) ([i (in-range N)])
    (f3 i)))
'compose1
(time (for/fold ([k 0]) ([i (in-range N)])
    (f4 i)))

The results at the time (on Racket 8.6, I believe) were:

'app
cpu time: 1764 real time: 1835 gc time: 9
99999999
'call-with-values
cpu time: 2373 real time: 2472 gc time: 33
99999999
'compose
cpu time: 9486 real time: 11408 gc time: 25
99999999
'compose1
cpu time: 7579 real time: 8128 gc time: 24
99999999

[NOTE: the 2373 value was actually obtained on a version of the code that assumed a single input value. Handling arbitrary number of input arguments (as f2 does above) brought the time up to something like 3 or 4 seconds]

Sid added a compose-with-values macro to the benchmarks that just rewrites an apparent composition to the call-with-values (f2) version:

(define-syntax-parser compose-with-values
  [(_ args)
   #'(apply values args)]
  [(_ args f g ...)
   #'(call-with-values
      (λ ()
        (compose-with-values args g ...))
      f)])

(define-syntax-parser comp
  [(_ f ...) #'(λ args
                 (compose-with-values args
                                      f ...))])

(define (f5 x)
  ((comp add1 sub1) x))

And this achieved comparable performance to the hand-written call-with-values.

Encouraged, Sid promptly made this change in the Qi compiler, expecting to see a similar speedup in Qi's benchmarks. But running:

$ make benchmark-nonlocal
Running nonlocal benchmarks...
[...]
composition:
12 ms

... before and after the change, showed a slight slowdown instead, to something like 19 ms for the above benchmark. Of course, the numbers are so small and the variance is high enough that it's hard to know for sure, but this benchmark appeared to consistently show a 2x slowdown instead of a 2x speedup!

Investigating the CP0 output showed some differences, but we didn't see any obvious culprits at the time.

Returning to the Scene of the Crime

We returned to it this time to see if we could get to the bottom of it.

Since the last time we ran these benchmarks, Sid has upgraded to Racket 8.16 (from Racket 8.6!). Right away, we noticed some changes:

'app
cpu time: 1429 real time: 1474 gc time: 1
99999999
'call-with-values
cpu time: 3861 real time: 4050 gc time: 44
99999999
'compose
cpu time: 5919 real time: 6123 gc time: 20
99999999
'compose1
cpu time: 6101 real time: 6332 gc time: 22
99999999
'compose-with-values
cpu time: 3722 real time: 4005 gc time: 38
99999999

First, the numbers were generally significantly lower (i.e. faster). But also, compose was now performing as well as compose1! Probably, for such simple cases, the Racket compiler is now able to infer rewriting compose to compose1? In any case, the important feature is that call-with-values is still almost 2x as fast, so we would still like to take advantage of it in the Qi compiler.

Our goal was to understand: why does using compose-with-values in the Qi compiler not gain the performance benefits we see in the standalone benchmarks?

Module Boundaries?

Our first guess was that it might have to do with the fact that the standalone benchmarks have all the definitions in a single module, whereas in the Qi codebase, definitions are required across module boundaries, which might potentially interfere with optimizations like inlining. So we tried putting the standalone implemention inside a submodule:

(module submod1 racket/base
  (provide compose-with-values
           comp
           f5)
  (require syntax/parse/define
           (for-syntax racket/base))

  (define-syntax-parser compose-with-values
    [(_ args)
     #'(apply values args)]
    [(_ args f g ...)
     #'(call-with-values
        (λ ()
          (compose-with-values args g ...))
        f)])

  (define-syntax-parser comp
    [(_ f ...) #'(λ args
                   (compose-with-values args
                                        f ...))])
  (define (f5 x)
    ((comp add1 sub1) x)))

But, weirdly enough, instead of showing degradation, the benchmark now seemed to show a marked improvement, by almost 50%.

'compose-with-values
cpu time: 2961 real time: 3104 gc time: 27

This was interesting, but not what we were trying to understand at the moment, so we continued trying other things.

Identifying an Upper Bound

Next, to continue exploring the performance space, we guessed that the functions in the compose implementation are probably being composed every time during invocation, prior to their application, and maybe avoiding that could help. We tried redefining the compose and compose1 versions to "lift" the composition out so that it's done just once and then reused on each invocation.

(define f3-helper (compose add1 sub1))

(define (f3 x)
  (f3-helper x))

(define f4-helper (compose1 add1 sub1))

(define (f4 x)
  (f4-helper x))

This showed:

'compose
cpu time: 1478 real time: 1544 gc time: 0
'compose1
cpu time: 1451 real time: 1517 gc time: 0

That is, compose and compose1 were now as fast as the hand-written, manual composition.

This "lifted composition" reaffirmed the upper bound on performance. We considered whether we could "lift" compositions out of flows in the Qi compiler, the same way that we lift bindings out, using "lifting lets." But we realized that we couldn't do that, because it wouldn't be guaranteed to always work. For instance, it would likely break in something like this where one of the functions in the composition is mutated within the flow:

(~> (add1) (as f) (esc (λ () (set! f sub1))) f sub1 sqr)

A pathological case, to be sure, but still.

Looking for Evidence

We returned again to compose-with-values. Why doesn't it perform as well as expected in the Qi compiler?

It was time to start looking at the actual compiled output produced by the standalone implementation (i.e. f5 in the benchmarks above) vs the Qi compiler implementation, to see what was different.

We used the following Qi code:

(require qi)

(define (qcompose x)
  (~> (x) sub1 add1))

We obtained the compiled output in the usual way, via the PLT_LINKLET_SHOW_CP0 command-line flag:

$ PLT_LINKLET_SHOW_CP0=1 raco make scratch.rkt 2>&1 |less

We noticed that the standalone version compiled to the linklet:

;; compile-linklet: step: linklet
;; ---------------------
(linklet ((.get-syntax-literal!) (.set-transformer!)) (qcompose)
  (void)
  (define-values (qcompose)
    (#%name
      qcompose
      (lambda (x_32)
        ((lambda args_33
           (call-with-values
             (lambda ()
               (call-with-values (lambda () (apply values args_33)) sub1))
             add1))
          x_32))))
  (void))

... whereas the Qi compiler version compiled to:

;; compile-linklet: step: linklet
;; ---------------------
(linklet ((.get-syntax-literal!) (.set-transformer!))
  (qcompose lifted/2.1.unreadable:lifted/2.1) (void)
  (define-values (lifted/2.1.unreadable:lifted/2.1) (void))
  (define-values (qcompose)
    (#%name
      qcompose
      (lambda (x_1)
        ((let-values ()
           (lambda args_3
             (call-with-values
               (lambda ()
                 (call-with-values (lambda () (apply values args_3)) sub1))
               add1)))
          x_1))))
  (void))

We noticed that there was an extra let-values in the Qi version, but that they otherwise appeared to be the same.

Next, at the "schemified" stage, the standalone version:

;; compile-linklet: step: schemified
;; ---------------------
(lambda (instance-variable-reference .get-syntax-literal!1
         .set-transformer!2 qcompose3)
  (define qcompose
    (#%name
      qcompose
      (lambda (x_32)
        (let ([args_33 (list x_32)])
          (#%call-with-values
            (#%name
              \x5B;...37ca8a1/scratch.rkt:10:6
              (lambda ()
                (#%call-with-values
                  (#%name
                    \x5B;...37ca8a1/scratch.rkt:10:6
                    (lambda () (apply values args_33)))
                  sub1)))
            add1)))))

And the Qi version:

;; compile-linklet: step: schemified
;; ---------------------
(lambda (instance-variable-reference .get-syntax-literal!1
         .set-transformer!2 qcompose3 lifted/2.14)
  (define lifted/2.1.unreadable:lifted/2.1 '#<void>)
  (define qcompose
    (#%name
      qcompose
      (lambda (x_1)
        ((#%name
           \x5B;...mpiler/1000-qi0.rkt:171:9
           (lambda args_3
             (#%call-with-values
               (#%name
                 \x5B;...ow/core/runtime.rkt:42:6
                 (lambda ()
                   (#%call-with-values
                     (#%name
                       \x5B;...ow/core/runtime.rkt:42:6
                       (lambda () (apply values args_3)))
                     sub1)))
               add1)))
          x_1))))

We noted the presence of the (let ([args_33 (list x_32)]) in the former, and we assumed that it meant that it was inferring a single argument in this case instead of (and more efficient than) using a variadic implementation. Since the only difference at the linket stage was the empty let-values form, we assumed that this was somehow blocking the lower level compiler optimization from being applied. To trace the origin of this let-values form, we looked at the expansion stages of the Qi expression in the macro stepper.

We discovered that it was introduced at this step:

(define-values (qcompose)
  (lambda (x)
    (#%app
     (let () (qi0->racket (thread (esc (#%host-expression sub1)) (esc (#%host-expression add1)))))
     x)))

→  [Macro transformation]

(define-values (qcompose)
  (lambda (x)
    (#%app
     (let-values ()
       (qi0->racket (thread (esc (#%host-expression sub1)) (esc (#%host-expression add1)))))
     x)))

At this point, we realized that the original let here is introduced as part of our handling of bindings in the compiler by "lifting lets" out of the flow, which we were just talking about earlier in a different context.

The End

As there are no bindings in the code here, we felt we could make a change to conditionally apply the bindings transformation only where bindings are actually present, and avoid it otherwise. After making this change, the expansion of the two expressions was now the same.

We tried running the Qi benchmark again:

$ make benchmark-nonlocal
Running nonlocal benchmarks...
[...]
composition:
19 ms

Hmm, it still didn't show the improvement we were looking for. But we felt that this benchmark was too crude to give us a reliable result, and that we should instead write a proper benchmark to probe this further in vlibench. We were optimistic that this would show us the result we were looking for.

… Or Is It?

But looking at it again now in preparing these meeting notes, it appears that even though the schemified stages differ between the standalone and Qi versions above, the CP0 versions (a later stage in compilation) are equivalent!

Standalone:

;; compile-linklet: step: cp0
;; ---------------------
(lambda (instance-variable-reference.47
         .get-syntax-literal!1.48 .set-transformer!2.49 qcompose3.50)
  (letrec ([qcompose.51 (lambda (x_32.52)
                          (let ([args_33.53 (#2%list x_32.52)])
                            (#%call-with-values
                              (lambda ()
                                (#%call-with-values
                                  (lambda () (apply #2%values args_33.53))
                                  #2%sub1))
                              #2%add1)))])

Qi:

;; compile-linklet: step: cp0
;; ---------------------
(lambda (instance-variable-reference.92
         .get-syntax-literal!1.93 .set-transformer!2.94 qcompose3.95
         lifted/2.14.96)
  (letrec ([qcompose.97 (lambda (x_1.98)
                          (let ([args_3.99 (#3%list x_1.98)])
                            (#%call-with-values
                              (lambda ()
                                (#%call-with-values
                                  (lambda () (apply #2%values args_3.99))
                                  #2%sub1))
                              #2%add1)))])

So, perhaps, this was not the cause of the difference in benchmarking results, after all! Yet, if the expansion is the same, why wouldn't the performance be the same? Is there anything about the benchmark itself might account for this difference? Is call-with-values actually faster than naive compose, or isn't it?

The plot thickens.

Next Steps

(Some of these are carried over from last time)

Write more reliable nonlocal benchmarks (in vlibench)
Get to the bottom of the performance puzzle in using compose-with-values.
Define the define-producer, define-transformer, and define-consumer interface for extending deforestation, and re-implement existing operations using it.
Implement more fusable stream components like drop, append, and member. [issue #118]
Why is range-map-car slower against Racket following the Qi 5 release?
Why is the benefit of call-with-values in isolation not manifesting when incorporated into the Qi compiler? [PR #191]
Ensure (eventually) that both codegen and deforested runtimes are included in the info struct and not in the IR.
Resolve the issue with bindings being prematurely evaluated. [issue syntax-spec#61]
Fix the bug in using bindings in deforestable forms like range [issue #195]
Write a proof-of-concept compiling Qi to another backend (such as threads or futures), and document the recipe for doing this.
Review Cover's methodology for checking coverage (e.g. wrt. phases) [related issue: #189]
Improve unit testing infrastructure for deforestation.
Decide on appropriate reference implementations to use for comparison in the new benchmarks report and add them.
Decide on whether there will be any deforestation in the Qi core, upon (require qi) (without (require qi/list))

Attendees

Dominik, Sid, Stephen

Qi Meeting Mar 14 2025 - drym-org/qi GitHub Wiki

A Puzzling Performance

Summary

Background

Rhombus Integration?

A Performance Puzzle

The Facts of the Case

Returning to the Scene of the Crime

Module Boundaries?

Identifying an Upper Bound

Looking for Evidence

The End

… Or Is It?

Next Steps

Attendees

⚠️ GitHub.com Fallback ⚠️

Qi Meeting Mar 14 2025 - drym-org/qi GitHub Wiki

A Puzzling Performance

Summary

Background

Rhombus Integration?

A Performance Puzzle

The Facts of the Case

Returning to the Scene of the Crime

Module Boundaries?

Identifying an Upper Bound

Looking for Evidence

The End

… Or Is It?

Next Steps

Attendees

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️