chacha_simd_bench - xero/leviathan-crypto GitHub Wiki

ChaCha20 SIMD 4-Wide Benchmark Results

Measured throughput results for the 4-wide inter-block SIMD implementation (chachaEncryptChunk_simd) across Chromium, Firefox, WebKit, and Bun. See chacha_audit.md for algorithm correctness verifications.

Table of Contents


Environment

4-wide inter-block SIMD (chachaEncryptChunk_simd): each v128 register lane holds word w from a different block (counters ctr, ctr+1, ctr+2, ctr+3). This is the same parallelism model used in Serpent CTR-4.

  • Date: 2026-03-27
  • Hardware: Apple Silicon (arm64)
  • Bun version: measured via bun run test
  • Browsers: Playwright; Chromium, Firefox, WebKit
  • Benchmark: test/e2e/chacha20_simd_bench.spec.ts
    • 50 warmup iterations, then 200โ€“5000 timed trials per chunk size
    • Key: RFC 8439 ยง2.4.2 all-zero-sequential, Nonce: SWEEP_NONCE

Browser throughput

Single thread.

Chromium (V8)

Chunk size Scalar (MB/s) SIMD (MB/s) Speedup
65,536 B 506.1 1285.0 2.54ร—
16,384 B 512.0 1204.7 2.35ร—
256 B 328.2 711.1 2.17ร—

Firefox (SpiderMonkey)

Chunk size Scalar (MB/s) SIMD (MB/s) Speedup
65,536 B 24.9 60.1 2.42ร—
16,384 B 23.4 56.9 2.43ร—
256 B 22.5 53.3 2.38ร—

WebKit (JSC)

Chunk size Scalar (MB/s) SIMD (MB/s) Speedup
65,536 B 409.6 1191.6 2.91ร—
16,384 B 431.2 1365.3 3.17ร—
256 B 256.0 426.7 1.67ร—

Bun

V8-based; measured via extended benchmark in test/unit/chacha20/chacha20_simd_4x_gate.test.ts (50 warmup, 200 trials):

Chunk size Scalar (MB/s) SIMD (MB/s) Speedup
65,536 B ~310โ€“330 ~970โ€“1030 ~3.11ร—
16,384 B ~310โ€“330 ~980โ€“1050 ~3.17ร—

Analysis

Inter-block SIMD delivers 2โ€“3ร— gains across all tested runtimes.

Firefox (SpiderMonkey) has significantly lower absolute throughput (~22โ€“60 MB/s vs ~250โ€“1365 MB/s on V8/JSC) for both scalar and SIMD paths. This is a known SpiderMonkey characteristic for tight WASM inner loops with many fixed-address loads; SpiderMonkey does not perform the same alias-analysis-based register promotion that V8 applies. Despite the lower absolute numbers, the speedup ratio is consistent (2.38โ€“2.43ร—); SpiderMonkey benefits from SIMD proportionally.

SIMD recovers Firefox throughput relative to scalar. The scalar path relies on fixed-address loads to the state matrix (CHACHA_STATE_OFFSET). V8/JSC recognise these as loop-invariant and register-promote them. SpiderMonkey does not, paying memory traffic on every iteration. The SIMD path loads all 16 state words once into v128 locals before the round loop, making the loop-invariant promotion explicit in the code, so SpiderMonkey sees the same working set as V8/JSC.

256-byte inputs (minimum SIMD threshold, exactly one 4-block group) show a smaller gain on WebKit (1.67ร—) and a larger gain on Firefox (2.38ร—). At this size the loop-body overhead is proportionally larger; the larger Firefox gain follows from the v128 local benefit described above.


Negative result: intra-block SIMD

A prior attempt at intra-block SIMD (one block using v128 with shuffles) was benchmarked across 4 attempts and measured 0.60ร—, 0.72ร—, 0.71ร—, 0.70ร— scalar, uniformly slower across all runtimes. Root causes:

1. No i32x4.rotl in WASM SIMD; 3ร— rotation cost.

WASM SIMD has no rotate-left instruction for v128. Each rotation requires three instructions: i32x4.shl + i32x4.shr_u + v128.or. ChaCha20 performs 8 rotations per quarter-round ร— 8 quarter-rounds per double-round ร— 10 double-rounds = 640 rotations total. The 3ร— cost triples the instruction count for the most frequent operation in the entire cipher.

2. 6 cross-lane shuffles per double-round.

The diagonal quarter-rounds require word indices to be realigned across v128 lanes after the column rounds. Each realignment costs a i8x16.shuffle instruction. Six shuffles per double-round with no scalar equivalent; pure overhead.

3. V8/JSC register-promotion neutralises the memory traffic advantage.

The expected win from SIMD was eliminating repeated loads of the 16-word state matrix (CHACHA_STATE_OFFSET words 0โ€“15, fixed constant addresses). V8 and JSC already apply register promotion to these fixed-address loads, keeping all 16 words in scalar registers across the round loop. SIMD was supposed to load them once into v128 locals, but V8/JSC already do the equivalent. The advantage does not materialise.

The inter-block 4-wide approach (chacha20_simd_4x.ts) avoids all three issues: it processes 4 independent blocks simultaneously, so each SIMD instruction does 4ร— the useful work. Rotation cost per block is identical to scalar but 4 blocks complete in the same time. No diagonal alignment is needed (independent blocks require no shuffles). And the v128 local loads are genuinely beneficial since the 4-block working set does not fit in scalar registers.


Cross-References

Document Description
index Project Documentation index
asm_chacha WASM API reference including SIMD exports
chacha20 TypeScript wrapper classes
serpent_simd_bench Serpent-256 SIMD benchmark (same inter-block model)
chacha_audit.md XChaCha20-Poly1305 implementation audit