serpent_simd_bench - xero/leviathan-crypto GitHub Wiki

Serpent-256 SIMD Benchmark Results

Measured throughput for 4-wide inter-block SIMD (encryptChunk_simd) across Chromium, Firefox, and WebKit on Apple Silicon. See Serpent implementation audit for algorithm correctness verifications.

Table of Contents


4-wide inter-block SIMD (encryptChunk_simd): each v128 register lane holds word w from a different block (counters ctr, ctr+1, ctr+2, ctr+3). Same parallelism model as ChaCha20 CTR-4.

Environment

  • Date: 2026-03-27
  • Hardware: Apple Silicon (arm64)
  • Browsers: Playwright — Chromium, Firefox, WebKit
  • Benchmark: test/e2e/serpent_simd_bench.spec.ts
    • 50-100 warmup iterations, then 200-2000 timed trials per chunk size
    • Key: 32-byte sequential (0x00..0x1f), Nonce: 16-byte sequential (0x00..0x0f)

Browser throughput, single thread

Chromium (V8)

Chunk size Scalar (MB/s) SIMD (MB/s) Speedup
65,536 B 15.0 38.9 2.59×
16,384 B 15.2 39.1 2.58×
1,024 B 14.7 37.6 2.55×

Firefox (SpiderMonkey)

Chunk size Scalar (MB/s) SIMD (MB/s) Speedup
65,536 B 7.1 15.8 2.22×
16,384 B 7.4 15.7 2.11×
1,024 B 7.0 14.8 2.12×

WebKit (JSC)

Chunk size Scalar (MB/s) SIMD (MB/s) Speedup
65,536 B 33.7 43.5 1.29×
16,384 B 34.7 43.6 1.26×
1,024 B 32.5 40.2 1.24×

Analysis

Inter-block SIMD delivers 1.2-2.6× gains across all tested runtimes.

Chromium (V8) and Firefox (SpiderMonkey) see the largest gains (2.1-2.6×). WebKit (JSC) shows a smaller but consistent gain (1.24-1.29×); JSC's scalar JIT is already more aggressive for this workload, leaving less headroom for SIMD.

Firefox absolute throughput is lower (~7 MB/s scalar vs ~15-35 MB/s on V8/JSC) for the same reason as ChaCha20: SpiderMonkey does not apply the same alias-analysis-based register promotion that V8/JSC use for fixed-address loads. The speedup ratio is consistent (2.11-2.22×) despite lower absolute numbers.

The 1,024-byte chunk size (Serpent CTR-4 SIMD threshold is 64 bytes; a single 4-block group is 64 bytes) shows speedup essentially equal to large chunks. Unlike ChaCha20 where the minimum SIMD threshold (256 bytes) affects small-chunk ratios, Serpent's smaller block size means SIMD benefits appear at much smaller inputs.


CBC decrypt, single thread

CBC encryption is not parallelizable (sequential dependency: CT[n] = encrypt(PT[n] XOR CT[n-1])). Only the decrypt path benefits from SIMD.

CBC decrypt SIMD benchmarks are not yet measured. Use the above CTR numbers as a proxy; the SIMD model is identical (4-wide inter-block parallelism on independent blocks), and CBC decrypt is structurally identical to CTR encryption for the purpose of SIMD throughput.


Cross-References

Document Description
index Project Documentation index
asm_serpent WASM API reference including SIMD exports
serpent TypeScript wrapper classes
chacha_simd_bench ChaCha20 SIMD benchmark (same inter-block model)
serpent_audit.md Serpent-256 implementation audit