chacha_simd_bench - xero/leviathan-crypto GitHub Wiki
ChaCha20 SIMD 4-Wide Benchmark Results
Measured throughput results for the 4-wide inter-block SIMD implementation (chachaEncryptChunk_simd) across Chromium, Firefox, WebKit, and Bun. See chacha_audit.md for algorithm correctness verifications.
Table of Contents
Environment
4-wide inter-block SIMD (chachaEncryptChunk_simd): each v128 register lane
holds word w from a different block (counters ctr, ctr+1, ctr+2, ctr+3).
This is the same parallelism model used in Serpent CTR-4.
- Date: 2026-03-27
- Hardware: Apple Silicon (arm64)
- Bun version: measured via
bun run test - Browsers: Playwright; Chromium, Firefox, WebKit
- Benchmark:
test/e2e/chacha20_simd_bench.spec.ts- 50 warmup iterations, then 200โ5000 timed trials per chunk size
- Key: RFC 8439 ยง2.4.2 all-zero-sequential, Nonce: SWEEP_NONCE
Browser throughput
Single thread.
Chromium (V8)
| Chunk size | Scalar (MB/s) | SIMD (MB/s) | Speedup |
|---|---|---|---|
| 65,536 B | 506.1 | 1285.0 | 2.54ร |
| 16,384 B | 512.0 | 1204.7 | 2.35ร |
| 256 B | 328.2 | 711.1 | 2.17ร |
Firefox (SpiderMonkey)
| Chunk size | Scalar (MB/s) | SIMD (MB/s) | Speedup |
|---|---|---|---|
| 65,536 B | 24.9 | 60.1 | 2.42ร |
| 16,384 B | 23.4 | 56.9 | 2.43ร |
| 256 B | 22.5 | 53.3 | 2.38ร |
WebKit (JSC)
| Chunk size | Scalar (MB/s) | SIMD (MB/s) | Speedup |
|---|---|---|---|
| 65,536 B | 409.6 | 1191.6 | 2.91ร |
| 16,384 B | 431.2 | 1365.3 | 3.17ร |
| 256 B | 256.0 | 426.7 | 1.67ร |
Bun
V8-based; measured via extended benchmark in test/unit/chacha20/chacha20_simd_4x_gate.test.ts
(50 warmup, 200 trials):
| Chunk size | Scalar (MB/s) | SIMD (MB/s) | Speedup |
|---|---|---|---|
| 65,536 B | ~310โ330 | ~970โ1030 | ~3.11ร |
| 16,384 B | ~310โ330 | ~980โ1050 | ~3.17ร |
Analysis
Inter-block SIMD delivers 2โ3ร gains across all tested runtimes.
Firefox (SpiderMonkey) has significantly lower absolute throughput (~22โ60 MB/s vs ~250โ1365 MB/s on V8/JSC) for both scalar and SIMD paths. This is a known SpiderMonkey characteristic for tight WASM inner loops with many fixed-address loads; SpiderMonkey does not perform the same alias-analysis-based register promotion that V8 applies. Despite the lower absolute numbers, the speedup ratio is consistent (2.38โ2.43ร); SpiderMonkey benefits from SIMD proportionally.
SIMD recovers Firefox throughput relative to scalar.
The scalar path relies on fixed-address loads to the state matrix
(CHACHA_STATE_OFFSET). V8/JSC recognise these as loop-invariant and
register-promote them. SpiderMonkey does not, paying memory traffic on every
iteration. The SIMD path loads all 16 state words once into v128 locals before
the round loop, making the loop-invariant promotion explicit in the code, so
SpiderMonkey sees the same working set as V8/JSC.
256-byte inputs (minimum SIMD threshold, exactly one 4-block group) show a smaller gain on WebKit (1.67ร) and a larger gain on Firefox (2.38ร). At this size the loop-body overhead is proportionally larger; the larger Firefox gain follows from the v128 local benefit described above.
Negative result: intra-block SIMD
A prior attempt at intra-block SIMD (one block using v128 with shuffles) was benchmarked across 4 attempts and measured 0.60ร, 0.72ร, 0.71ร, 0.70ร scalar, uniformly slower across all runtimes. Root causes:
1. No i32x4.rotl in WASM SIMD; 3ร rotation cost.
WASM SIMD has no rotate-left instruction for v128. Each rotation requires three
instructions: i32x4.shl + i32x4.shr_u + v128.or. ChaCha20 performs 8
rotations per quarter-round ร 8 quarter-rounds per double-round ร 10
double-rounds = 640 rotations total. The 3ร cost triples the instruction count
for the most frequent operation in the entire cipher.
2. 6 cross-lane shuffles per double-round.
The diagonal quarter-rounds require word indices to be realigned across v128
lanes after the column rounds. Each realignment costs a i8x16.shuffle
instruction. Six shuffles per double-round with no scalar equivalent; pure
overhead.
3. V8/JSC register-promotion neutralises the memory traffic advantage.
The expected win from SIMD was eliminating repeated loads of the 16-word state
matrix (CHACHA_STATE_OFFSET words 0โ15, fixed constant addresses). V8 and JSC
already apply register promotion to these fixed-address loads, keeping all 16
words in scalar registers across the round loop. SIMD was supposed to load them
once into v128 locals, but V8/JSC already do the equivalent. The advantage does
not materialise.
The inter-block 4-wide approach (chacha20_simd_4x.ts) avoids all three issues:
it processes 4 independent blocks simultaneously, so each SIMD instruction does
4ร the useful work. Rotation cost per block is identical to scalar but 4 blocks
complete in the same time. No diagonal alignment is needed (independent blocks
require no shuffles). And the v128 local loads are genuinely beneficial since the
4-block working set does not fit in scalar registers.
Cross-References
| Document | Description |
|---|---|
| index | Project Documentation index |
| asm_chacha | WASM API reference including SIMD exports |
| chacha20 | TypeScript wrapper classes |
| serpent_simd_bench | Serpent-256 SIMD benchmark (same inter-block model) |
| chacha_audit.md | XChaCha20-Poly1305 implementation audit |