Performance Benchmarks - Steel-SecAdv-LLC/AMA-Cryptography GitHub Wiki
Performance Benchmarks
Authoritative source:
BENCHMARKS.md(generated locally by runningpython benchmark_suite.py; not checked into version control) is the authoritative Python-API benchmark document.benchmark-results.json/benchmark-report.md(generated bypython benchmarks/benchmark_runner.py) anchor the CI regression gate.build/bin/benchmark_c_raw --jsonreports raw C throughput without any ctypes overhead. The tables on this wiki page are a snapshot refreshed alongside the repository; they will shift by ±20 % on a different host.
Benchmark results for AMA Cryptography on Linux x86-64. All measurements use the native C library via Python/ctypes unless noted.
Platform: Linux x86-64 | CPU: 16 logical cores (AVX-512F/BW/DQ/VL/VBMI + VAES + VPCLMULQDQ) | Python: 3.11.15 Date: 2026-04-25 | ML-DSA-65 Backend: native C (no OpenSSL, no liboqs)
Summary Dashboard
| Operation | Mean (ms) | Ops/sec |
|---|---|---|
| SHA3-256 (32 B) | 0.001 | 1,002,079 |
| HMAC-SHA3-256 auth (32 B) | 0.004 | 231,090 |
| HMAC-SHA3-256 verify (32 B) | 0.005 | 183,619 |
| HKDF-SHA3-256 (96 B output) | 0.059 | 16,898 |
| Ed25519 keygen | 0.030 | 33,073 |
| Ed25519 sign (240 B) | 0.020 | 50,805 |
| Ed25519 verify (240 B) | 0.049 | 20,559 |
| ML-DSA-65 keygen | 0.280 | 3,574 |
| ML-DSA-65 sign | 0.338 | 2,958 |
| ML-DSA-65 verify | 0.137 | 7,309 |
| KMS generation | 0.425 | 2,353 |
| Package creation (multi-layer) | 0.277 | 3,612 |
| Package verification | 0.230 | 4,348 |
(Output of python benchmark_suite.py — all numbers Python/ctypes path on the measurement host. See the notes below for the difference between these figures and the CI-regression-suite baseline.)
Key Generation
| Operation | Mean (ms) | Median (ms) | Std Dev (ms) | Ops/sec | Iterations |
|---|---|---|---|---|---|
| master_secret | 0.0049 | 0.0043 | 0.0028 | 202,241 | 10,000 |
| hkdf_derivation | 0.0592 | 0.0535 | 0.0156 | 16,898 | 1,000 |
| ed25519_keygen | 0.0302 | 0.0285 | 0.0123 | 33,073 | 1,000 |
| dilithium_keygen | 0.2798 | 0.2755 | 0.0255 | 3,574 | 100 |
| kms_generation | 0.4250 | 0.4011 | 0.0743 | 2,353 | 100 |
Cryptographic Operations
| Operation | Mean (ms) | Median (ms) | Std Dev (ms) | Ops/sec | Iterations |
|---|---|---|---|---|---|
| sha3_256 | 0.0010 | 0.0009 | 0.0004 | 1,002,079 | 10,000 |
| hmac_auth | 0.0043 | 0.0040 | 0.0015 | 231,090 | 10,000 |
| hmac_verify | 0.0054 | 0.0049 | 0.0015 | 183,619 | 10,000 |
| ed25519_sign | 0.0197 | 0.0172 | 0.0044 | 50,805 | 1,000 |
| ed25519_verify | 0.0486 | 0.0446 | 0.0086 | 20,559 | 1,000 |
| dilithium_sign | 0.3381 | 0.3345 | 0.0178 | 2,958 | 100 |
| dilithium_verify | 0.1368 | 0.1341 | 0.0146 | 7,309 | 100 |
Package Operations (Multi-Layer)
| Operation | Mean (ms) | Median (ms) | Std Dev (ms) | Ops/sec | Iterations |
|---|---|---|---|---|---|
| canonical_encoding | 0.0015 | 0.0014 | 0.0006 | 657,855 | 10,000 |
| code_hash | 0.0154 | 0.0140 | 0.0038 | 65,012 | 10,000 |
| package_creation | 0.2769 | 0.2659 | 0.0975 | 3,612 | 100 |
| package_verification | 0.2300 | 0.2247 | 0.0279 | 4,348 | 100 |
Ethical Integration Overhead
| Operation | Mean (ms) | Ops/sec |
|---|---|---|
| ethical_context | 0.0046 | 218,867 |
| hkdf_standard | 0.0079 | 126,715 |
| hkdf_with_ethical | 0.0218 | 45,951 |
Ethical context overhead: 0.0139 ms ≈ 13.9 µs wall-time (≈ 2.8× standard-HKDF latency), i.e., well under a millisecond. Negligible per-operation cost at the throughputs listed above (45,951 ops/sec).
Scalability (Package Creation by Input Size)
| Input Scale | Mean (ms) | Ops/sec | Iterations |
|---|---|---|---|
| 1x baseline | 0.3865 | 2,587 | 50 |
| 10x | 0.5393 | 1,854 | 50 |
| 100x | 2.9212 | 342 | 50 |
| 1000x | 94.7441 | 11 | 50 |
Performance Notes
Cython Acceleration
When built with Cython (python setup.py build_ext --inplace), mathematical operations in the 3R monitoring engine (Lyapunov stability, helical computations, NTT polynomial operations) show:
- 18–37x speedup over the pure Python mathematical baseline
- NumPy-integrated batch operations
Cython acceleration does not affect C-implemented cryptographic primitives (they are already native). The speedup comparison baseline is pure Python loops — not the native C library.
Algorithm Comparison
| Algorithm | Sign (ms) | Verify (ms) | Sig Size |
|---|---|---|---|
| Ed25519 | 0.09 | 0.14 | 64 bytes |
| ML-DSA-65 | 0.53 | 0.15 | 3,309 bytes |
| Hybrid (Ed25519 + ML-DSA-65) | ~0.62 | ~0.29 | 3,373 bytes |
| SPHINCS+-SHA2-256f | ~230 | ~5.90 | 49,856 bytes |
ML-DSA-65 is ~6× slower to sign than Ed25519 on this host (pre-SIMD scalar NTT path) but provides NIST category III quantum security. Sign/verify latency shifts substantially with CPU microarchitecture — re-run
benchmark_suite.pyon your deployment host before quoting numbers externally.
X25519 Field-Path Selection (3.0.0)
The X25519 Montgomery ladder now picks its field-arithmetic representation deterministically at compile time:
| Toolchain / target | Path | Layout |
|---|---|---|
| x86-64 GCC/Clang + __int128 | fe64 | radix 2^64, 4 limbs of uint64_t |
| Other 64-bit GCC/Clang + __int128 (aarch64, ppc64le, …) | fe51 | radix 2^51, 5 limbs |
| MSVC, clang-cl, 32-bit, no __int128 | gf16 | radix 2^16, 16 limbs of int64_t |
Verify which path the local build picked:
./build/bin/test_x25519_path
The two __int128 paths are byte-for-byte arithmetic equivalent — see
tests/c/test_x25519_field_equiv.c, which runs 1024 random
(scalar, point) vectors through both ladders and asserts every output
matches.
On a Sapphire Rapids canonical-host run with benchmark_c_raw,
the previous pure-C fe64 path measured ~11,500 X25519 DH ops/sec
vs ~21,800 for fe51 on the same hardware (Linux, GCC 12,
-O3 -march=native). The radix-2^64 schoolbook trails the
radix-2^51 carry-pipelined layout in pure C because GCC does not
yet generate MULX+ADCX (BMI2+ADX) for the 4×4 schoolbook pattern.
X25519 fe64 MULX+ADX kernel (3.0.0, PR D)
When CPUID reports both BMI2 (CPUID.(EAX=7,ECX=0):EBX[8]) and
ADX (EBX[19]), the dispatcher promotes the inner ladder's
multiply / square to the in-house MULX+ADCX/ADOX kernel in
src/c/internal/ama_x25519_fe64_mulx.c, compiled with per-file
-mbmi2 -madx -O3 flags. Bundle gate:
ama_cpuid_has_x25519_mulx() (defensive: gates each bit
explicitly even though every shipped Intel Broadwell+ / AMD Zen+
part has both).
Same canonical-host class with the kernel active, this build's
benchmark sandbox measures ~13,168 X25519 DH ops/sec via the
Python C-FFI runner (or ~13,988 ops/sec when the C-raw harness
amortises the FFI layer away) — a real ~21 % improvement over
the pure-C fe64 baseline. Byte-equivalence to pure-C fe64
asserted across 4096 / 4096 random vectors by
tests/c/test_x25519_fe64_mulx_equiv.c (skips with code 77 on
hosts whose CPUID lacks the bundle).
The kernel is implemented as GCC/Clang inline assembly with
explicit mulx (BMI2) plus adcx / adox (ADX) instructions —
not via _mulx_u64 + _addcarry_u64 intrinsics. The inline-asm
path exists specifically because GCC's _addcarry_u64 did not
lower to ADCX/ADOX even under -madx; without the explicit
mnemonic the kernel's lo-column and hi-column carry chains would
serialise through a single adc chain instead of running in
parallel. The kernel also ships a dedicated squaring path
that exploits the off-diagonal symmetry of (sum f_i)^2
(10 multiplications: 6 cross products doubled + 4 diagonal
squares — vs 16 for the full schoolbook). The active kernel is
therefore already a hand-written inline-asm path using BMI2
MULX plus explicit ADX ADCX / ADOX carry-chain interleave
behind the same CPUID gate. The remaining gap to the ~25K
ops/sec reported by OpenSSL's hand-tuned
crypto/ec/asm/x25519-x86_64.pl on the same microarchitecture
class therefore reflects broader implementation differences
(instruction scheduling, register allocation, reduction shape,
and surrounding glue), not reliance on compiler-lowered
intrinsics or the absence of a hand-written asm kernel. fe51
remains available as a fallback by building with
-DAMA_X25519_FORCE_FE51; the pure-C fe64 schoolbook still runs
on hosts whose CPUID lacks BMI2 + ADX (e.g. KVM guest with the
bits masked, pre-Broadwell host, or any MSVC build — the kernel
TU is GCC/Clang only).
X25519 4-way batch API (3.0.0, currently opt-in)
ama_x25519_scalarmult_batch(out[], scalars[], points[], count) is
a new additive API that exposes a 4-way AVX2 Montgomery-ladder
kernel for batched Diffie-Hellman. ama_print_dispatch_info()
reports its capability row as X25519 4-way: AVX2 (opt-in, off)
whenever the host has AVX2 but the kernel pointer is unwired —
the kernel does not light up automatically. Set
AMA_DISPATCH_USE_X25519_AVX2=1 in the environment to opt in.
Why opt-in: on x86-64 hosts where the scalar X25519 path is fe64 +
MULX/ADX (Broadwell+ Intel, Zen+ AMD), four sequential scalar
ladders are faster than four lanes of the AVX2 donna-32bit
ladder. The kernel uses 32-bit limbs because AVX2 lacks a
64×64→128 lane-wise multiply (that arrived with AVX-512 IFMA's
VPMADD52LUQ / VPMADD52HUQ); donna-32bit's larger
cross-product schedule outpaces the 4× SIMD width on
Skylake-class cores. See the CHANGELOG [3.0.0] Performance
entry for the per-op measurement and the full retention rationale
(constant-time test lane, CI matrix coverage, fe51/gf16 fallback
hosts, and the planned AVX-512 IFMA port that closes the gap).
SIMD Acceleration Paths (3.0.0)
Every SIMD path below is gated on a runtime CPUID check (with the
appropriate XCR0 state-save check where the path uses ZMM or YMM
registers), built with per-file ISA flags via
set_source_files_properties so the rest of the library stays at the
lowest-common-denominator ISA, and verified against a scalar
reference for byte-identity by the matching tests/c/test_* and
tests/c/test_*_equiv.c lanes. Opt-out env vars are honoured at
runtime so an operator can pin to scalar without rebuilding.
| Primitive | Engineered path | Build / runtime gate | Speedup vs scalar | Reference |
|---|---|---|---|---|
| Keccak-f[1600] (SHA3, SHAKE) | AVX-512 4-way (vprolq + vpternlogq, EVEX-encoded YMM, no ZMM in the hot path) |
Build: -DAMA_ENABLE_AVX512=ON (default OFF). Runtime: ama_cpuid_has_avx512_keccak() (AVX-512F + VL + BW + DQ + XCR0 5+6+7) |
~1.6× over AVX2 4-way on Sapphire Rapids; falls back cleanly to AVX2 4-way otherwise | docs/AVX512_KECCAK_ADR.md, src/c/avx512/ama_sha3_x4_avx512.c, tests/c/test_sha3_avx512_kat.c |
| Keccak-f[1600] (SHA3, SHAKE) | AVX2 4-way (Keccak-f[1600] across 4 SIMD lanes) | Build: default ON. Runtime: ama_cpuid_has_avx2() |
~3-4× over scalar Keccak | src/c/avx2/ama_sha3_x4_avx2.c |
| AES-256-GCM | VAES + VPCLMULQDQ on YMM (4 blocks per AES round, 4-way GHASH reduction) | Build: default ON. Runtime: ama_cpuid_has_vaes_aesgcm() (VAES + VPCLMULQDQ + AVX2 + XCR0). Opt-out: AMA_DISPATCH_NO_VAES=1 |
~1.5-2× at ≥4 KB messages on Ice Lake+ / Zen 4 | src/c/avx2/ama_aes_gcm_vaes.c, tests/c/test_aes_gcm_vaes_equiv.c |
| AES-256-GCM (S-box) | Bitsliced (tower field GF((2^4)^2)) — constant-time default | Build: -DAMA_AES_CONSTTIME=ON (default ON). Hardware fallback also available where AES-NI is present |
n/a (correctness — eliminates cache-timing channel) | src/c/ama_aes_bitsliced.c |
| ChaCha20-Poly1305 | 8-way AVX2 ChaCha20 block function (512 B keystream per kernel invocation) | Runtime: ama_cpuid_has_avx2(). Opt-out: AMA_DISPATCH_NO_CHACHA_AVX2=1 |
2.11× at 1 KB, 2.24× at 4 KB, 2.29× at 64 KB; messages < 512 B stay on scalar | src/c/avx2/ama_chacha20_x8_avx2.c, tests/c/test_chacha20poly1305.c |
| Argon2id | 4-way BlaMka G AVX2 (_mm256_mul_epu32 for the multiplication-hardened add; _mm256_permute4x64_epi64 for the diagonal pass) |
Runtime: ama_cpuid_has_avx2(). Opt-out: AMA_DISPATCH_NO_ARGON2_AVX2=1 |
1.31× at m=64 KiB, 1.34× at m=1 MiB | src/c/avx2/ama_argon2_g_avx2.c, tests/c/test_argon2id.c |
| Ed25519 sign | Base-point comb table (radix-2^51 fe51 field arithmetic) | Default ON for x86-64 GCC/Clang (fe51.h) |
Sign ~5× faster vs the previous scalar path on this host class | src/c/ama_ed25519.c (PR #261) |
| Ed25519 verify | Width-5 wNAF + Shamir's trick (double-scalar-mult variable-time on public-only inputs) | Default ON for x86-64 GCC/Clang | Verify ~2× faster on this host class | src/c/ama_ed25519.c (PR #265) |
| X25519 scalar-mult | fe64 schoolbook + MULX/ADCX/ADOX in-house inline assembly (4-limb radix-2^64 with dual-carry-chain interleave) | Build: per-file -mbmi2 -madx. Runtime: ama_cpuid_has_x25519_mulx() (BMI2 + ADX). Pure-C fe64.h is the fallback |
~21% over pure-C fe64 on the local sandbox; literature 1.8-2.2× on uncontended Skylake+ / Zen+ | src/c/internal/ama_x25519_fe64_mulx.c, tests/c/test_x25519_fe64_mulx_equiv.c |
| X25519 batch-4 | AVX2 4-way Montgomery ladder (donna-32bit field, OPT-IN) | Runtime: only when AMA_DISPATCH_USE_X25519_AVX2=1 (default OFF — scalar fe64 is faster on MULX/ADX hosts) |
Off by design on MULX/ADX hosts; reserved for fe51/gf16 fallback hosts and the future AVX-512 IFMA port | src/c/avx2/ama_x25519_avx2.c, tests/test_x25519_dispatch_policy.py |
| ML-DSA-65 / ML-KEM-1024 sampling | 4-way SHAKE128 / SHAKE256 across 4 SIMD lanes; CBD2 noise sampling AVX2-vectorised | Runtime: ama_cpuid_has_avx2() |
Throughput-bound by the SHAKE rounds; sign / encaps ~3× faster than the scalar reference on this host | src/c/avx2/ama_*_avx2.c (PR #260) |
| Dispatch auto-tune | Best-of-5 hysteresis (10% reversion threshold) for SHA-3 SIMD vs scalar selection | Opt-out: AMA_DISPATCH_NO_AUTOTUNE=1 |
Eliminates AVX2/NEON Keccak revert-to-scalar flakes on shared CI runners | src/c/dispatch/ama_dispatch.c |
Verbose dispatch table at startup: AMA_DISPATCH_VERBOSE=1
prints every selected backend (and (opt-in, off) annotations on
opt-in paths that the runtime advertised but did not select) on first
crypto call to stderr.
3R Monitoring Overhead
- Monitoring overhead: < 2% on typical workloads
- Anomaly detection runs asynchronously in the background
- FFT computations use NumPy for batch processing when available
Reproducing Benchmarks
# Install dependencies
pip install -e ".[dev,monitoring]"
# Build native library
cmake -B build -DAMA_USE_NATIVE_PQC=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build
# Run benchmark suite
python3 benchmark_suite.py
# Or run the regression runner
python3 benchmarks/benchmark_runner.py -v
Results are saved to benchmark_results.json, BENCHMARKS.md, and benchmarks/regression_results.json.
* HMAC-SHA3-256 uses the Cython binding when built (python setup.py build_ext --inplace) — zero marshaling overhead calling native C ama_hmac_sha3_256. Falls back to ctypes when the extension is absent.
Why HMAC numbers look different across paths. Three measurement paths produce three different figures for the same primitive:
- Cython microbenchmark on a 32 B message: ~250k ops/sec on this host (
benchmark_suite.py"hmac_auth" column above).- Pure ctypes on a 1 KB message: ~130k ops/sec (
benchmarks/benchmark_runner.py→benchmark-results.json, baseline 76,215).- Shared GitHub Actions runner under CI: ~12k ops/sec (much slower, noisier hardware). The
benchmarks/baseline.jsonvalue is set for the CI host and is not a statement about the primitive's performance in general.All three are measurements of
ama_hmac_sha3_256. The right number to quote depends on which environment the reader cares about; cite the measurement command alongside the number.
Performance — canonical-host throughput vs. regression floor
The headline ops/sec figures below are the canonical-host measurements
written by benchmarks/benchmark_runner.py --output benchmark-results.json
(the same command CI runs in the "Benchmark Regression Detection" job)
and read from benchmark-results.json by tools/update_docs.py. The
Regression floor column is the value enforced by
benchmarks/baseline.json; CI fails the run when measured throughput
drops more than tolerance_percent below the floor. Both columns are
shown so reviewers see the headline and the safety net side-by-side.
To refresh after a benchmark run on the canonical host:
LD_LIBRARY_PATH=build/lib python3 benchmarks/benchmark_runner.py \
--output benchmark-results.json \
--markdown benchmark-report.md
python3 tools/update_docs.py # regenerates the table below
Headline source: benchmark-results.json (run 2026-04-27). Regression floor: benchmarks/baseline.json. CI fails on (measured - tolerance%) < floor — both columns shown so reviewers can sanity-check the headroom.
| Benchmark | Throughput (ops/sec) | Regression floor (ops/sec) | Tolerance | Tier |
|---|---|---|---|---|
| Ama Sha3 256 Hash | 230,244 | 31,000 | ±35% | microbenchmark |
| Hmac Sha3 256 | 148,565 | 19,500 | ±40% | microbenchmark |
| Ed25519 Keygen | 48,134 | 10,560 | ±35% | microbenchmark |
| Ed25519 Sign | 51,046 | 10,430 | ±35% | microbenchmark |
| Ed25519 Verify | 21,097 | 5,113 | ±35% | microbenchmark |
| Hkdf Derive | 95,433 | 12,500 | ±35% | microbenchmark |
| Full Package Create | 3,813.1 | 200 | ±70% | complex_operation |
| Full Package Verify | 4,055.4 | 700 | ±50% | complex_operation |
| Dilithium Keygen (optional) | 3,331.0 | 1,943 | ±40% | microbenchmark |
| Dilithium Sign (optional) | 1,103.7 | 130 | ±50% | microbenchmark |
| Dilithium Verify (optional) | 7,215.7 | 900 | ±40% | microbenchmark |
| Kyber Keygen (optional) | 5,346.1 | 2,200 | ±40% | microbenchmark |
| Kyber Encapsulate (optional) | 11,688 | 2,400 | ±40% | microbenchmark |
| Aes 256 Gcm Encrypt (optional) | 276,778 | 150,000 | ±40% | microbenchmark |
| Chacha20Poly1305 Encrypt (optional) | 215,256 | 32,000 | ±40% | microbenchmark |
| X25519 Scalarmult (optional) | 17,560 | 13,000 | ±40% | microbenchmark |
| X25519 Scalarmult Batch4 (optional) | 4,112.2 | 2,600 | ±40% | microbenchmark |
See Cryptography Algorithms for algorithm key sizes, or Architecture for the multi-language performance architecture.
Standards Compliance Note
This library implements algorithms specified in FIPS 203 (ML-KEM), FIPS 204 (ML-DSA), FIPS 205 (SLH-DSA), and FIPS 202 (SHA-3). This implementation has NOT been submitted for CMVP validation and is NOT FIPS 140-3 certified. See CSRC_STANDARDS.md for detailed compliance status.