Performance Benchmarks - Steel-SecAdv-LLC/AMA-Cryptography GitHub Wiki

Performance Benchmarks

Authoritative source: BENCHMARKS.md (generated locally by running python benchmark_suite.py; not checked into version control) is the authoritative Python-API benchmark document. benchmark-results.json / benchmark-report.md (generated by python benchmarks/benchmark_runner.py) anchor the CI regression gate. build/bin/benchmark_c_raw --json reports raw C throughput without any ctypes overhead. The tables on this wiki page are a snapshot refreshed alongside the repository; they will shift by ±20 % on a different host.

Benchmark results for AMA Cryptography on Linux x86-64. All measurements use the native C library via Python/ctypes unless noted.

Platform: Linux x86-64 | CPU: 16 logical cores (AVX-512F/BW/DQ/VL/VBMI + VAES + VPCLMULQDQ) | Python: 3.11.15 Date: 2026-04-25 | ML-DSA-65 Backend: native C (no OpenSSL, no liboqs)


Summary Dashboard

Operation Mean (ms) Ops/sec
SHA3-256 (32 B) 0.001 1,002,079
HMAC-SHA3-256 auth (32 B) 0.004 231,090
HMAC-SHA3-256 verify (32 B) 0.005 183,619
HKDF-SHA3-256 (96 B output) 0.059 16,898
Ed25519 keygen 0.030 33,073
Ed25519 sign (240 B) 0.020 50,805
Ed25519 verify (240 B) 0.049 20,559
ML-DSA-65 keygen 0.280 3,574
ML-DSA-65 sign 0.338 2,958
ML-DSA-65 verify 0.137 7,309
KMS generation 0.425 2,353
Package creation (multi-layer) 0.277 3,612
Package verification 0.230 4,348

(Output of python benchmark_suite.py — all numbers Python/ctypes path on the measurement host. See the notes below for the difference between these figures and the CI-regression-suite baseline.)


Key Generation

Operation Mean (ms) Median (ms) Std Dev (ms) Ops/sec Iterations
master_secret 0.0049 0.0043 0.0028 202,241 10,000
hkdf_derivation 0.0592 0.0535 0.0156 16,898 1,000
ed25519_keygen 0.0302 0.0285 0.0123 33,073 1,000
dilithium_keygen 0.2798 0.2755 0.0255 3,574 100
kms_generation 0.4250 0.4011 0.0743 2,353 100

Cryptographic Operations

Operation Mean (ms) Median (ms) Std Dev (ms) Ops/sec Iterations
sha3_256 0.0010 0.0009 0.0004 1,002,079 10,000
hmac_auth 0.0043 0.0040 0.0015 231,090 10,000
hmac_verify 0.0054 0.0049 0.0015 183,619 10,000
ed25519_sign 0.0197 0.0172 0.0044 50,805 1,000
ed25519_verify 0.0486 0.0446 0.0086 20,559 1,000
dilithium_sign 0.3381 0.3345 0.0178 2,958 100
dilithium_verify 0.1368 0.1341 0.0146 7,309 100

Package Operations (Multi-Layer)

Operation Mean (ms) Median (ms) Std Dev (ms) Ops/sec Iterations
canonical_encoding 0.0015 0.0014 0.0006 657,855 10,000
code_hash 0.0154 0.0140 0.0038 65,012 10,000
package_creation 0.2769 0.2659 0.0975 3,612 100
package_verification 0.2300 0.2247 0.0279 4,348 100

Ethical Integration Overhead

Operation Mean (ms) Ops/sec
ethical_context 0.0046 218,867
hkdf_standard 0.0079 126,715
hkdf_with_ethical 0.0218 45,951

Ethical context overhead: 0.0139 ms ≈ 13.9 µs wall-time (≈ 2.8× standard-HKDF latency), i.e., well under a millisecond. Negligible per-operation cost at the throughputs listed above (45,951 ops/sec).


Scalability (Package Creation by Input Size)

Input Scale Mean (ms) Ops/sec Iterations
1x baseline 0.3865 2,587 50
10x 0.5393 1,854 50
100x 2.9212 342 50
1000x 94.7441 11 50

Performance Notes

Cython Acceleration

When built with Cython (python setup.py build_ext --inplace), mathematical operations in the 3R monitoring engine (Lyapunov stability, helical computations, NTT polynomial operations) show:

  • 18–37x speedup over the pure Python mathematical baseline
  • NumPy-integrated batch operations

Cython acceleration does not affect C-implemented cryptographic primitives (they are already native). The speedup comparison baseline is pure Python loops — not the native C library.

Algorithm Comparison

Algorithm Sign (ms) Verify (ms) Sig Size
Ed25519 0.09 0.14 64 bytes
ML-DSA-65 0.53 0.15 3,309 bytes
Hybrid (Ed25519 + ML-DSA-65) ~0.62 ~0.29 3,373 bytes
SPHINCS+-SHA2-256f ~230 ~5.90 49,856 bytes

ML-DSA-65 is ~6× slower to sign than Ed25519 on this host (pre-SIMD scalar NTT path) but provides NIST category III quantum security. Sign/verify latency shifts substantially with CPU microarchitecture — re-run benchmark_suite.py on your deployment host before quoting numbers externally.

X25519 Field-Path Selection (3.0.0)

The X25519 Montgomery ladder now picks its field-arithmetic representation deterministically at compile time:

Toolchain / target Path Layout
x86-64 GCC/Clang + __int128 fe64 radix 2^64, 4 limbs of uint64_t
Other 64-bit GCC/Clang + __int128 (aarch64, ppc64le, …) fe51 radix 2^51, 5 limbs
MSVC, clang-cl, 32-bit, no __int128 gf16 radix 2^16, 16 limbs of int64_t

Verify which path the local build picked:

./build/bin/test_x25519_path

The two __int128 paths are byte-for-byte arithmetic equivalent — see tests/c/test_x25519_field_equiv.c, which runs 1024 random (scalar, point) vectors through both ladders and asserts every output matches.

On a Sapphire Rapids canonical-host run with benchmark_c_raw, the previous pure-C fe64 path measured ~11,500 X25519 DH ops/sec vs ~21,800 for fe51 on the same hardware (Linux, GCC 12, -O3 -march=native). The radix-2^64 schoolbook trails the radix-2^51 carry-pipelined layout in pure C because GCC does not yet generate MULX+ADCX (BMI2+ADX) for the 4×4 schoolbook pattern.

X25519 fe64 MULX+ADX kernel (3.0.0, PR D)

When CPUID reports both BMI2 (CPUID.(EAX=7,ECX=0):EBX[8]) and ADX (EBX[19]), the dispatcher promotes the inner ladder's multiply / square to the in-house MULX+ADCX/ADOX kernel in src/c/internal/ama_x25519_fe64_mulx.c, compiled with per-file -mbmi2 -madx -O3 flags. Bundle gate: ama_cpuid_has_x25519_mulx() (defensive: gates each bit explicitly even though every shipped Intel Broadwell+ / AMD Zen+ part has both).

Same canonical-host class with the kernel active, this build's benchmark sandbox measures ~13,168 X25519 DH ops/sec via the Python C-FFI runner (or ~13,988 ops/sec when the C-raw harness amortises the FFI layer away) — a real ~21 % improvement over the pure-C fe64 baseline. Byte-equivalence to pure-C fe64 asserted across 4096 / 4096 random vectors by tests/c/test_x25519_fe64_mulx_equiv.c (skips with code 77 on hosts whose CPUID lacks the bundle).

The kernel is implemented as GCC/Clang inline assembly with explicit mulx (BMI2) plus adcx / adox (ADX) instructions — not via _mulx_u64 + _addcarry_u64 intrinsics. The inline-asm path exists specifically because GCC's _addcarry_u64 did not lower to ADCX/ADOX even under -madx; without the explicit mnemonic the kernel's lo-column and hi-column carry chains would serialise through a single adc chain instead of running in parallel. The kernel also ships a dedicated squaring path that exploits the off-diagonal symmetry of (sum f_i)^2 (10 multiplications: 6 cross products doubled + 4 diagonal squares — vs 16 for the full schoolbook). The active kernel is therefore already a hand-written inline-asm path using BMI2 MULX plus explicit ADX ADCX / ADOX carry-chain interleave behind the same CPUID gate. The remaining gap to the ~25K ops/sec reported by OpenSSL's hand-tuned crypto/ec/asm/x25519-x86_64.pl on the same microarchitecture class therefore reflects broader implementation differences (instruction scheduling, register allocation, reduction shape, and surrounding glue), not reliance on compiler-lowered intrinsics or the absence of a hand-written asm kernel. fe51 remains available as a fallback by building with -DAMA_X25519_FORCE_FE51; the pure-C fe64 schoolbook still runs on hosts whose CPUID lacks BMI2 + ADX (e.g. KVM guest with the bits masked, pre-Broadwell host, or any MSVC build — the kernel TU is GCC/Clang only).

X25519 4-way batch API (3.0.0, currently opt-in)

ama_x25519_scalarmult_batch(out[], scalars[], points[], count) is a new additive API that exposes a 4-way AVX2 Montgomery-ladder kernel for batched Diffie-Hellman. ama_print_dispatch_info() reports its capability row as X25519 4-way: AVX2 (opt-in, off) whenever the host has AVX2 but the kernel pointer is unwired — the kernel does not light up automatically. Set AMA_DISPATCH_USE_X25519_AVX2=1 in the environment to opt in.

Why opt-in: on x86-64 hosts where the scalar X25519 path is fe64 + MULX/ADX (Broadwell+ Intel, Zen+ AMD), four sequential scalar ladders are faster than four lanes of the AVX2 donna-32bit ladder. The kernel uses 32-bit limbs because AVX2 lacks a 64×64→128 lane-wise multiply (that arrived with AVX-512 IFMA's VPMADD52LUQ / VPMADD52HUQ); donna-32bit's larger cross-product schedule outpaces the 4× SIMD width on Skylake-class cores. See the CHANGELOG [3.0.0] Performance entry for the per-op measurement and the full retention rationale (constant-time test lane, CI matrix coverage, fe51/gf16 fallback hosts, and the planned AVX-512 IFMA port that closes the gap).

SIMD Acceleration Paths (3.0.0)

Every SIMD path below is gated on a runtime CPUID check (with the appropriate XCR0 state-save check where the path uses ZMM or YMM registers), built with per-file ISA flags via set_source_files_properties so the rest of the library stays at the lowest-common-denominator ISA, and verified against a scalar reference for byte-identity by the matching tests/c/test_* and tests/c/test_*_equiv.c lanes. Opt-out env vars are honoured at runtime so an operator can pin to scalar without rebuilding.

Primitive Engineered path Build / runtime gate Speedup vs scalar Reference
Keccak-f[1600] (SHA3, SHAKE) AVX-512 4-way (vprolq + vpternlogq, EVEX-encoded YMM, no ZMM in the hot path) Build: -DAMA_ENABLE_AVX512=ON (default OFF). Runtime: ama_cpuid_has_avx512_keccak() (AVX-512F + VL + BW + DQ + XCR0 5+6+7) ~1.6× over AVX2 4-way on Sapphire Rapids; falls back cleanly to AVX2 4-way otherwise docs/AVX512_KECCAK_ADR.md, src/c/avx512/ama_sha3_x4_avx512.c, tests/c/test_sha3_avx512_kat.c
Keccak-f[1600] (SHA3, SHAKE) AVX2 4-way (Keccak-f[1600] across 4 SIMD lanes) Build: default ON. Runtime: ama_cpuid_has_avx2() ~3-4× over scalar Keccak src/c/avx2/ama_sha3_x4_avx2.c
AES-256-GCM VAES + VPCLMULQDQ on YMM (4 blocks per AES round, 4-way GHASH reduction) Build: default ON. Runtime: ama_cpuid_has_vaes_aesgcm() (VAES + VPCLMULQDQ + AVX2 + XCR0). Opt-out: AMA_DISPATCH_NO_VAES=1 ~1.5-2× at ≥4 KB messages on Ice Lake+ / Zen 4 src/c/avx2/ama_aes_gcm_vaes.c, tests/c/test_aes_gcm_vaes_equiv.c
AES-256-GCM (S-box) Bitsliced (tower field GF((2^4)^2)) — constant-time default Build: -DAMA_AES_CONSTTIME=ON (default ON). Hardware fallback also available where AES-NI is present n/a (correctness — eliminates cache-timing channel) src/c/ama_aes_bitsliced.c
ChaCha20-Poly1305 8-way AVX2 ChaCha20 block function (512 B keystream per kernel invocation) Runtime: ama_cpuid_has_avx2(). Opt-out: AMA_DISPATCH_NO_CHACHA_AVX2=1 2.11× at 1 KB, 2.24× at 4 KB, 2.29× at 64 KB; messages < 512 B stay on scalar src/c/avx2/ama_chacha20_x8_avx2.c, tests/c/test_chacha20poly1305.c
Argon2id 4-way BlaMka G AVX2 (_mm256_mul_epu32 for the multiplication-hardened add; _mm256_permute4x64_epi64 for the diagonal pass) Runtime: ama_cpuid_has_avx2(). Opt-out: AMA_DISPATCH_NO_ARGON2_AVX2=1 1.31× at m=64 KiB, 1.34× at m=1 MiB src/c/avx2/ama_argon2_g_avx2.c, tests/c/test_argon2id.c
Ed25519 sign Base-point comb table (radix-2^51 fe51 field arithmetic) Default ON for x86-64 GCC/Clang (fe51.h) Sign ~5× faster vs the previous scalar path on this host class src/c/ama_ed25519.c (PR #261)
Ed25519 verify Width-5 wNAF + Shamir's trick (double-scalar-mult variable-time on public-only inputs) Default ON for x86-64 GCC/Clang Verify ~2× faster on this host class src/c/ama_ed25519.c (PR #265)
X25519 scalar-mult fe64 schoolbook + MULX/ADCX/ADOX in-house inline assembly (4-limb radix-2^64 with dual-carry-chain interleave) Build: per-file -mbmi2 -madx. Runtime: ama_cpuid_has_x25519_mulx() (BMI2 + ADX). Pure-C fe64.h is the fallback ~21% over pure-C fe64 on the local sandbox; literature 1.8-2.2× on uncontended Skylake+ / Zen+ src/c/internal/ama_x25519_fe64_mulx.c, tests/c/test_x25519_fe64_mulx_equiv.c
X25519 batch-4 AVX2 4-way Montgomery ladder (donna-32bit field, OPT-IN) Runtime: only when AMA_DISPATCH_USE_X25519_AVX2=1 (default OFF — scalar fe64 is faster on MULX/ADX hosts) Off by design on MULX/ADX hosts; reserved for fe51/gf16 fallback hosts and the future AVX-512 IFMA port src/c/avx2/ama_x25519_avx2.c, tests/test_x25519_dispatch_policy.py
ML-DSA-65 / ML-KEM-1024 sampling 4-way SHAKE128 / SHAKE256 across 4 SIMD lanes; CBD2 noise sampling AVX2-vectorised Runtime: ama_cpuid_has_avx2() Throughput-bound by the SHAKE rounds; sign / encaps ~3× faster than the scalar reference on this host src/c/avx2/ama_*_avx2.c (PR #260)
Dispatch auto-tune Best-of-5 hysteresis (10% reversion threshold) for SHA-3 SIMD vs scalar selection Opt-out: AMA_DISPATCH_NO_AUTOTUNE=1 Eliminates AVX2/NEON Keccak revert-to-scalar flakes on shared CI runners src/c/dispatch/ama_dispatch.c

Verbose dispatch table at startup: AMA_DISPATCH_VERBOSE=1 prints every selected backend (and (opt-in, off) annotations on opt-in paths that the runtime advertised but did not select) on first crypto call to stderr.

3R Monitoring Overhead

  • Monitoring overhead: < 2% on typical workloads
  • Anomaly detection runs asynchronously in the background
  • FFT computations use NumPy for batch processing when available

Reproducing Benchmarks

# Install dependencies
pip install -e ".[dev,monitoring]"

# Build native library
cmake -B build -DAMA_USE_NATIVE_PQC=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build

# Run benchmark suite
python3 benchmark_suite.py

# Or run the regression runner
python3 benchmarks/benchmark_runner.py -v

Results are saved to benchmark_results.json, BENCHMARKS.md, and benchmarks/regression_results.json.


* HMAC-SHA3-256 uses the Cython binding when built (python setup.py build_ext --inplace) — zero marshaling overhead calling native C ama_hmac_sha3_256. Falls back to ctypes when the extension is absent.

Why HMAC numbers look different across paths. Three measurement paths produce three different figures for the same primitive:

  • Cython microbenchmark on a 32 B message: ~250k ops/sec on this host (benchmark_suite.py "hmac_auth" column above).
  • Pure ctypes on a 1 KB message: ~130k ops/sec (benchmarks/benchmark_runner.pybenchmark-results.json, baseline 76,215).
  • Shared GitHub Actions runner under CI: ~12k ops/sec (much slower, noisier hardware). The benchmarks/baseline.json value is set for the CI host and is not a statement about the primitive's performance in general.

All three are measurements of ama_hmac_sha3_256. The right number to quote depends on which environment the reader cares about; cite the measurement command alongside the number.


Performance — canonical-host throughput vs. regression floor

The headline ops/sec figures below are the canonical-host measurements written by benchmarks/benchmark_runner.py --output benchmark-results.json (the same command CI runs in the "Benchmark Regression Detection" job) and read from benchmark-results.json by tools/update_docs.py. The Regression floor column is the value enforced by benchmarks/baseline.json; CI fails the run when measured throughput drops more than tolerance_percent below the floor. Both columns are shown so reviewers see the headline and the safety net side-by-side.

To refresh after a benchmark run on the canonical host:

LD_LIBRARY_PATH=build/lib python3 benchmarks/benchmark_runner.py \
    --output benchmark-results.json \
    --markdown benchmark-report.md
python3 tools/update_docs.py        # regenerates the table below

Headline source: benchmark-results.json (run 2026-04-27). Regression floor: benchmarks/baseline.json. CI fails on (measured - tolerance%) < floor — both columns shown so reviewers can sanity-check the headroom.

Benchmark Throughput (ops/sec) Regression floor (ops/sec) Tolerance Tier
Ama Sha3 256 Hash 230,244 31,000 ±35% microbenchmark
Hmac Sha3 256 148,565 19,500 ±40% microbenchmark
Ed25519 Keygen 48,134 10,560 ±35% microbenchmark
Ed25519 Sign 51,046 10,430 ±35% microbenchmark
Ed25519 Verify 21,097 5,113 ±35% microbenchmark
Hkdf Derive 95,433 12,500 ±35% microbenchmark
Full Package Create 3,813.1 200 ±70% complex_operation
Full Package Verify 4,055.4 700 ±50% complex_operation
Dilithium Keygen (optional) 3,331.0 1,943 ±40% microbenchmark
Dilithium Sign (optional) 1,103.7 130 ±50% microbenchmark
Dilithium Verify (optional) 7,215.7 900 ±40% microbenchmark
Kyber Keygen (optional) 5,346.1 2,200 ±40% microbenchmark
Kyber Encapsulate (optional) 11,688 2,400 ±40% microbenchmark
Aes 256 Gcm Encrypt (optional) 276,778 150,000 ±40% microbenchmark
Chacha20Poly1305 Encrypt (optional) 215,256 32,000 ±40% microbenchmark
X25519 Scalarmult (optional) 17,560 13,000 ±40% microbenchmark
X25519 Scalarmult Batch4 (optional) 4,112.2 2,600 ±40% microbenchmark

See Cryptography Algorithms for algorithm key sizes, or Architecture for the multi-language performance architecture.


Standards Compliance Note

This library implements algorithms specified in FIPS 203 (ML-KEM), FIPS 204 (ML-DSA), FIPS 205 (SLH-DSA), and FIPS 202 (SHA-3). This implementation has NOT been submitted for CMVP validation and is NOT FIPS 140-3 certified. See CSRC_STANDARDS.md for detailed compliance status.