Performance Benchmarks - Steel-SecAdv-LLC/AMA-Cryptography GitHub Wiki

Performance Benchmarks

Authoritative source: BENCHMARKS.md (generated locally by running python benchmark_suite.py; not checked into version control) is the authoritative Python-API benchmark document. benchmark-results.json / benchmark-report.md (generated by python benchmarks/benchmark_runner.py) anchor the CI regression gate. build/bin/benchmark_c_raw --json reports raw C throughput without any ctypes overhead. The tables on this wiki page are a snapshot refreshed alongside the repository; they will shift by ±20 % on a different host.

Benchmark results for AMA Cryptography on Linux x86-64. All measurements use the native C library via Python/ctypes unless noted.

Platform: Linux x86-64 | CPU: 16 logical cores (AVX-512F/BW/DQ/VL/VBMI + VAES + VPCLMULQDQ) | Python: 3.11.15 Date: 2026-04-25 | ML-DSA-65 Backend: native C (no OpenSSL, no liboqs)

Summary Dashboard

Operation	Mean (ms)	Ops/sec
SHA3-256 (32 B)	0.001	1,002,079
HMAC-SHA3-256 auth (32 B)	0.004	231,090
HMAC-SHA3-256 verify (32 B)	0.005	183,619
HKDF-SHA3-256 (96 B output)	0.059	16,898
Ed25519 keygen	0.030	33,073
Ed25519 sign (240 B)	0.020	50,805
Ed25519 verify (240 B)	0.049	20,559
ML-DSA-65 keygen	0.280	3,574
ML-DSA-65 sign	0.338	2,958
ML-DSA-65 verify	0.137	7,309
KMS generation	0.425	2,353
Package creation (multi-layer)	0.277	3,612
Package verification	0.230	4,348

(Output of python benchmark_suite.py — all numbers Python/ctypes path on the measurement host. See the notes below for the difference between these figures and the CI-regression-suite baseline.)

Key Generation

Operation	Mean (ms)	Median (ms)	Std Dev (ms)	Ops/sec	Iterations
master_secret	0.0049	0.0043	0.0028	202,241	10,000
hkdf_derivation	0.0592	0.0535	0.0156	16,898	1,000
ed25519_keygen	0.0302	0.0285	0.0123	33,073	1,000
dilithium_keygen	0.2798	0.2755	0.0255	3,574	100
kms_generation	0.4250	0.4011	0.0743	2,353	100

Cryptographic Operations

Operation	Mean (ms)	Median (ms)	Std Dev (ms)	Ops/sec	Iterations
sha3_256	0.0010	0.0009	0.0004	1,002,079	10,000
hmac_auth	0.0043	0.0040	0.0015	231,090	10,000
hmac_verify	0.0054	0.0049	0.0015	183,619	10,000
ed25519_sign	0.0197	0.0172	0.0044	50,805	1,000
ed25519_verify	0.0486	0.0446	0.0086	20,559	1,000
dilithium_sign	0.3381	0.3345	0.0178	2,958	100
dilithium_verify	0.1368	0.1341	0.0146	7,309	100

Package Operations (Multi-Layer)

Operation	Mean (ms)	Median (ms)	Std Dev (ms)	Ops/sec	Iterations
canonical_encoding	0.0015	0.0014	0.0006	657,855	10,000
code_hash	0.0154	0.0140	0.0038	65,012	10,000
package_creation	0.2769	0.2659	0.0975	3,612	100
package_verification	0.2300	0.2247	0.0279	4,348	100

Ethical Integration Overhead

Operation	Mean (ms)	Ops/sec
ethical_context	0.0046	218,867
hkdf_standard	0.0079	126,715
hkdf_with_ethical	0.0218	45,951

Ethical context overhead: 0.0139 ms ≈ 13.9 µs wall-time (≈ 2.8× standard-HKDF latency), i.e., well under a millisecond. Negligible per-operation cost at the throughputs listed above (45,951 ops/sec).

Scalability (Package Creation by Input Size)

Input Scale	Mean (ms)	Ops/sec	Iterations
1x baseline	0.3865	2,587	50
10x	0.5393	1,854	50
100x	2.9212	342	50
1000x	94.7441	11	50

Performance Notes

Cython Acceleration

When built with Cython (python setup.py build_ext --inplace), mathematical operations in the 3R monitoring engine (Lyapunov stability, helical computations, NTT polynomial operations) show:

18–37x speedup over the pure Python mathematical baseline
NumPy-integrated batch operations

Cython acceleration does not affect C-implemented cryptographic primitives (they are already native). The speedup comparison baseline is pure Python loops — not the native C library.

Algorithm Comparison

Algorithm	Sign (ms)	Verify (ms)	Sig Size
Ed25519	0.09	0.14	64 bytes
ML-DSA-65	0.53	0.15	3,309 bytes
Hybrid (Ed25519 + ML-DSA-65)	~0.62	~0.29	3,373 bytes
SPHINCS+-SHA2-256f	~230	~5.90	49,856 bytes

ML-DSA-65 is ~6× slower to sign than Ed25519 on this host (pre-SIMD scalar NTT path) but provides NIST category III quantum security. Sign/verify latency shifts substantially with CPU microarchitecture — re-run benchmark_suite.py on your deployment host before quoting numbers externally.

X25519 Field-Path Selection (3.0.0)

The X25519 Montgomery ladder now picks its field-arithmetic representation deterministically at compile time:

Toolchain / target	Path	Layout
x86-64 GCC/Clang + __int128	fe64	radix 2^64, 4 limbs of uint64_t
Other 64-bit GCC/Clang + __int128 (aarch64, ppc64le, …)	fe51	radix 2^51, 5 limbs
MSVC, clang-cl, 32-bit, no __int128	gf16	radix 2^16, 16 limbs of int64_t

Verify which path the local build picked:

./build/bin/test_x25519_path

The two __int128 paths are byte-for-byte arithmetic equivalent — see tests/c/test_x25519_field_equiv.c, which runs 1024 random (scalar, point) vectors through both ladders and asserts every output matches.

On a Sapphire Rapids canonical-host run with benchmark_c_raw, the previous pure-C fe64 path measured ~11,500 X25519 DH ops/sec vs ~21,800 for fe51 on the same hardware (Linux, GCC 12, -O3 -march=native). The radix-2^64 schoolbook trails the radix-2^51 carry-pipelined layout in pure C because GCC does not yet generate MULX+ADCX (BMI2+ADX) for the 4×4 schoolbook pattern.

X25519 fe64 MULX+ADX kernel (3.0.0, PR D)

When CPUID reports both BMI2 (CPUID.(EAX=7,ECX=0):EBX[8]) and ADX (EBX[19]), the dispatcher promotes the inner ladder's multiply / square to the in-house MULX+ADCX/ADOX kernel in src/c/internal/ama_x25519_fe64_mulx.c, compiled with per-file -mbmi2 -madx -O3 flags. Bundle gate: ama_cpuid_has_x25519_mulx() (defensive: gates each bit explicitly even though every shipped Intel Broadwell+ / AMD Zen+ part has both).

Same canonical-host class with the kernel active, this build's benchmark sandbox measures ~13,168 X25519 DH ops/sec via the Python C-FFI runner (or ~13,988 ops/sec when the C-raw harness amortises the FFI layer away) — a real ~21 % improvement over the pure-C fe64 baseline. Byte-equivalence to pure-C fe64 asserted across 4096 / 4096 random vectors by tests/c/test_x25519_fe64_mulx_equiv.c (skips with code 77 on hosts whose CPUID lacks the bundle).

The kernel is implemented as GCC/Clang inline assembly with explicit mulx (BMI2) plus adcx / adox (ADX) instructions — not via _mulx_u64 + _addcarry_u64 intrinsics. The inline-asm path exists specifically because GCC's _addcarry_u64 did not lower to ADCX/ADOX even under -madx; without the explicit mnemonic the kernel's lo-column and hi-column carry chains would serialise through a single adc chain instead of running in parallel. The kernel also ships a dedicated squaring path that exploits the off-diagonal symmetry of (sum f_i)^2 (10 multiplications: 6 cross products doubled + 4 diagonal squares — vs 16 for the full schoolbook). The active kernel is therefore already a hand-written inline-asm path using BMI2 MULX plus explicit ADX ADCX / ADOX carry-chain interleave behind the same CPUID gate. The remaining gap to the ~25K ops/sec reported by OpenSSL's hand-tuned crypto/ec/asm/x25519-x86_64.pl on the same microarchitecture class therefore reflects broader implementation differences (instruction scheduling, register allocation, reduction shape, and surrounding glue), not reliance on compiler-lowered intrinsics or the absence of a hand-written asm kernel. fe51 remains available as a fallback by building with -DAMA_X25519_FORCE_FE51; the pure-C fe64 schoolbook still runs on hosts whose CPUID lacks BMI2 + ADX (e.g. KVM guest with the bits masked, pre-Broadwell host, or any MSVC build — the kernel TU is GCC/Clang only).

X25519 4-way batch API (3.0.0, currently opt-in)

ama_x25519_scalarmult_batch(out[], scalars[], points[], count) is a new additive API that exposes a 4-way AVX2 Montgomery-ladder kernel for batched Diffie-Hellman. ama_print_dispatch_info() reports its capability row as X25519 4-way: AVX2 (opt-in, off) whenever the host has AVX2 but the kernel pointer is unwired — the kernel does not light up automatically. Set AMA_DISPATCH_USE_X25519_AVX2=1 in the environment to opt in.

Why opt-in: on x86-64 hosts where the scalar X25519 path is fe64 + MULX/ADX (Broadwell+ Intel, Zen+ AMD), four sequential scalar ladders are faster than four lanes of the AVX2 donna-32bit ladder. The kernel uses 32-bit limbs because AVX2 lacks a 64×64→128 lane-wise multiply (that arrived with AVX-512 IFMA's VPMADD52LUQ / VPMADD52HUQ); donna-32bit's larger cross-product schedule outpaces the 4× SIMD width on Skylake-class cores. See the CHANGELOG [3.0.0] Performance entry for the per-op measurement and the full retention rationale (constant-time test lane, CI matrix coverage, fe51/gf16 fallback hosts, and the planned AVX-512 IFMA port that closes the gap).

SIMD Acceleration Paths (3.0.0)

Every SIMD path below is gated on a runtime CPUID check (with the appropriate XCR0 state-save check where the path uses ZMM or YMM registers), built with per-file ISA flags via set_source_files_properties so the rest of the library stays at the lowest-common-denominator ISA, and verified against a scalar reference for byte-identity by the matching tests/c/test_* and tests/c/test_*_equiv.c lanes. Opt-out env vars are honoured at runtime so an operator can pin to scalar without rebuilding.

Primitive	Engineered path	Build / runtime gate	Speedup vs scalar	Reference
Keccak-f[1600] (SHA3, SHAKE)	AVX-512 4-way (`vprolq` + `vpternlogq`, EVEX-encoded YMM, no ZMM in the hot path)	Build: `-DAMA_ENABLE_AVX512=ON` (default OFF). Runtime: `ama_cpuid_has_avx512_keccak()` (AVX-512F + VL + BW + DQ + XCR0 5+6+7)	~1.6× over AVX2 4-way on Sapphire Rapids; falls back cleanly to AVX2 4-way otherwise	`docs/AVX512_KECCAK_ADR.md`, `src/c/avx512/ama_sha3_x4_avx512.c`, `tests/c/test_sha3_avx512_kat.c`
Keccak-f[1600] (SHA3, SHAKE)	AVX2 4-way (Keccak-f[1600] across 4 SIMD lanes)	Build: default ON. Runtime: `ama_cpuid_has_avx2()`	~3-4× over scalar Keccak	`src/c/avx2/ama_sha3_x4_avx2.c`
AES-256-GCM	VAES + VPCLMULQDQ on YMM (4 blocks per AES round, 4-way GHASH reduction)	Build: default ON. Runtime: `ama_cpuid_has_vaes_aesgcm()` (VAES + VPCLMULQDQ + AVX2 + XCR0). Opt-out: `AMA_DISPATCH_NO_VAES=1`	~1.5-2× at ≥4 KB messages on Ice Lake+ / Zen 4	`src/c/avx2/ama_aes_gcm_vaes.c`, `tests/c/test_aes_gcm_vaes_equiv.c`
AES-256-GCM (S-box)	Bitsliced (tower field GF((2^4)^2)) — constant-time default	Build: `-DAMA_AES_CONSTTIME=ON` (default ON). Hardware fallback also available where AES-NI is present	n/a (correctness — eliminates cache-timing channel)	`src/c/ama_aes_bitsliced.c`
ChaCha20-Poly1305	8-way AVX2 ChaCha20 block function (512 B keystream per kernel invocation)	Runtime: `ama_cpuid_has_avx2()`. Opt-out: `AMA_DISPATCH_NO_CHACHA_AVX2=1`	2.11× at 1 KB, 2.24× at 4 KB, 2.29× at 64 KB; messages < 512 B stay on scalar	`src/c/avx2/ama_chacha20_x8_avx2.c`, `tests/c/test_chacha20poly1305.c`
Argon2id	4-way BlaMka G AVX2 (`_mm256_mul_epu32` for the multiplication-hardened add; `_mm256_permute4x64_epi64` for the diagonal pass)	Runtime: `ama_cpuid_has_avx2()`. Opt-out: `AMA_DISPATCH_NO_ARGON2_AVX2=1`	1.31× at m=64 KiB, 1.34× at m=1 MiB	`src/c/avx2/ama_argon2_g_avx2.c`, `tests/c/test_argon2id.c`
Ed25519 sign	Base-point comb table (radix-2^51 fe51 field arithmetic)	Default ON for x86-64 GCC/Clang (`fe51.h`)	Sign ~5× faster vs the previous scalar path on this host class	`src/c/ama_ed25519.c` (PR #261)
Ed25519 verify	Width-5 wNAF + Shamir's trick (double-scalar-mult variable-time on public-only inputs)	Default ON for x86-64 GCC/Clang	Verify ~2× faster on this host class	`src/c/ama_ed25519.c` (PR #265)
X25519 scalar-mult	fe64 schoolbook + MULX/ADCX/ADOX in-house inline assembly (4-limb radix-2^64 with dual-carry-chain interleave)	Build: per-file `-mbmi2 -madx`. Runtime: `ama_cpuid_has_x25519_mulx()` (BMI2 + ADX). Pure-C `fe64.h` is the fallback	~21% over pure-C fe64 on the local sandbox; literature 1.8-2.2× on uncontended Skylake+ / Zen+	`src/c/internal/ama_x25519_fe64_mulx.c`, `tests/c/test_x25519_fe64_mulx_equiv.c`
X25519 batch-4	AVX2 4-way Montgomery ladder (donna-32bit field, OPT-IN)	Runtime: only when `AMA_DISPATCH_USE_X25519_AVX2=1` (default OFF — scalar fe64 is faster on MULX/ADX hosts)	Off by design on MULX/ADX hosts; reserved for fe51/gf16 fallback hosts and the future AVX-512 IFMA port	`src/c/avx2/ama_x25519_avx2.c`, `tests/test_x25519_dispatch_policy.py`
ML-DSA-65 / ML-KEM-1024 sampling	4-way SHAKE128 / SHAKE256 across 4 SIMD lanes; CBD2 noise sampling AVX2-vectorised	Runtime: `ama_cpuid_has_avx2()`	Throughput-bound by the SHAKE rounds; sign / encaps ~3× faster than the scalar reference on this host	`src/c/avx2/ama_*_avx2.c` (PR #260)
Dispatch auto-tune	Best-of-5 hysteresis (10% reversion threshold) for SHA-3 SIMD vs scalar selection	Opt-out: `AMA_DISPATCH_NO_AUTOTUNE=1`	Eliminates AVX2/NEON Keccak revert-to-scalar flakes on shared CI runners	`src/c/dispatch/ama_dispatch.c`

Verbose dispatch table at startup: AMA_DISPATCH_VERBOSE=1 prints every selected backend (and (opt-in, off) annotations on opt-in paths that the runtime advertised but did not select) on first crypto call to stderr.

3R Monitoring Overhead

Monitoring overhead: < 2% on typical workloads
Anomaly detection runs asynchronously in the background
FFT computations use NumPy for batch processing when available

Reproducing Benchmarks

# Install dependencies
pip install -e ".[dev,monitoring]"

# Build native library
cmake -B build -DAMA_USE_NATIVE_PQC=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build

# Run benchmark suite
python3 benchmark_suite.py

# Or run the regression runner
python3 benchmarks/benchmark_runner.py -v

Results are saved to benchmark_results.json, BENCHMARKS.md, and benchmarks/regression_results.json.

* HMAC-SHA3-256 uses the Cython binding when built (python setup.py build_ext --inplace) — zero marshaling overhead calling native C ama_hmac_sha3_256. Falls back to ctypes when the extension is absent.

Why HMAC numbers look different across paths. Three measurement paths produce three different figures for the same primitive:

Cython microbenchmark on a 32 B message: ~250k ops/sec on this host (benchmark_suite.py "hmac_auth" column above).

Pure ctypes on a 1 KB message: ~130k ops/sec (benchmarks/benchmark_runner.py → benchmark-results.json, baseline 76,215).

Shared GitHub Actions runner under CI: ~12k ops/sec (much slower, noisier hardware). The benchmarks/baseline.json value is set for the CI host and is not a statement about the primitive's performance in general.

All three are measurements of ama_hmac_sha3_256. The right number to quote depends on which environment the reader cares about; cite the measurement command alongside the number.

Performance — canonical-host throughput vs. regression floor

The headline ops/sec figures below are the canonical-host measurements written by benchmarks/benchmark_runner.py --output benchmark-results.json (the same command CI runs in the "Benchmark Regression Detection" job) and read from benchmark-results.json by tools/update_docs.py. The Regression floor column is the value enforced by benchmarks/baseline.json; CI fails the run when measured throughput drops more than tolerance_percent below the floor. Both columns are shown so reviewers see the headline and the safety net side-by-side.

To refresh after a benchmark run on the canonical host:

LD_LIBRARY_PATH=build/lib python3 benchmarks/benchmark_runner.py \
    --output benchmark-results.json \
    --markdown benchmark-report.md
python3 tools/update_docs.py        # regenerates the table below

Headline source: benchmark-results.json (run 2026-04-27). Regression floor: benchmarks/baseline.json. CI fails on (measured - tolerance%) < floor — both columns shown so reviewers can sanity-check the headroom.

Benchmark	Throughput (ops/sec)	Regression floor (ops/sec)	Tolerance	Tier
Ama Sha3 256 Hash	230,244	31,000	±35%	microbenchmark
Hmac Sha3 256	148,565	19,500	±40%	microbenchmark
Ed25519 Keygen	48,134	10,560	±35%	microbenchmark
Ed25519 Sign	51,046	10,430	±35%	microbenchmark
Ed25519 Verify	21,097	5,113	±35%	microbenchmark
Hkdf Derive	95,433	12,500	±35%	microbenchmark
Full Package Create	3,813.1	200	±70%	complex_operation
Full Package Verify	4,055.4	700	±50%	complex_operation
Dilithium Keygen (optional)	3,331.0	1,943	±40%	microbenchmark
Dilithium Sign (optional)	1,103.7	130	±50%	microbenchmark
Dilithium Verify (optional)	7,215.7	900	±40%	microbenchmark
Kyber Keygen (optional)	5,346.1	2,200	±40%	microbenchmark
Kyber Encapsulate (optional)	11,688	2,400	±40%	microbenchmark
Aes 256 Gcm Encrypt (optional)	276,778	150,000	±40%	microbenchmark
Chacha20Poly1305 Encrypt (optional)	215,256	32,000	±40%	microbenchmark
X25519 Scalarmult (optional)	17,560	13,000	±40%	microbenchmark
X25519 Scalarmult Batch4 (optional)	4,112.2	2,600	±40%	microbenchmark

See Cryptography Algorithms for algorithm key sizes, or Architecture for the multi-language performance architecture.

Standards Compliance Note

This library implements algorithms specified in FIPS 203 (ML-KEM), FIPS 204 (ML-DSA), FIPS 205 (SLH-DSA), and FIPS 202 (SHA-3). This implementation has NOT been submitted for CMVP validation and is NOT FIPS 140-3 certified. See CSRC_STANDARDS.md for detailed compliance status.