asm_chacha - xero/leviathan-crypto GitHub Wiki

logo

ChaCha20/Poly1305 WASM Reference

This low-level reference details the ChaCha20 AssemblyScript source and WASM > exports, intended for those auditing, contributing to, or building against the raw module. Most consumers should instead use the TypeScript wrapper or the higher-level AEAD classes.

Table of Contents


Overview

This module implements the full ChaCha20-Poly1305 AEAD family in one WASM binary with shared linear memory.

ChaCha20 (RFC 8439 §2.3-2.4) is a stream cipher. 256-bit key, 96-bit nonce, 32-bit block counter, 20 rounds (10 double rounds alternating column and diagonal quarter-rounds).

Poly1305 (RFC 8439 §2.5) is a one-time MAC. It authenticates messages of arbitrary length using a 256-bit one-time key (r || s). Internally it uses radix-2^26 representation with u64 limbs to avoid overflow during multiplication.

ChaCha20-Poly1305 AEAD (RFC 8439 §2.8) combines the two. The TypeScript layer orchestrates the composition by calling chachaGenPolyKey, chachaEncryptChunk, and polyInit/polyUpdate/polyFinal in sequence to produce authenticated ciphertext. The WASM module exports the primitives; TypeScript drives the construction.

HChaCha20 (draft-irtf-cfrg-xchacha §2.1) derives a 256-bit subkey from a key and 128-bit nonce prefix. XChaCha20 uses this to extend the nonce space to 192 bits, making random nonce generation practical for large message volumes.

All cryptography runs in WASM. The TypeScript layer writes inputs to linear memory, calls WASM exports, and reads outputs. It implements no algorithm logic.


Security Notes

Constant-time by construction. ChaCha20 uses only ARX operations (add, rotate, XOR). No lookups, no secret-dependent branches, no variable-time arithmetic. This design is why TLS 1.3 adopted ChaCha20 as its non-AES cipher—it's inherently resistant to cache-timing side channels.

Poly1305 accumulator arithmetic. Poly1305 uses radix-2^26 limbs stored in u64 words. Schoolbook multiplication over five limbs with reduction modulo p = 2^130 - 5. The u64 intermediate products avoid overflow without needing multi-precision carries. The final reduction in polyFinal uses constant-time conditional select (mask-and-OR) to choose between h and h - p, avoiding branches on secret values.

Nonce reuse is catastrophic. Reusing a (key, nonce) pair with standard ChaCha20 (96-bit nonce) leaks the XOR of two plaintexts and completely breaks Poly1305 authentication. With random 96-bit nonces, collision occurs after approximately 2^48 messages under the same key. If you require random nonces, use XChaCha20-Poly1305 instead.

XChaCha20 extends nonce to 192 bits. HChaCha20 derives a per-message subkey from the first 128 bits of a 192-bit nonce, then ChaCha20 encrypts with the remaining 64 bits (zero-padded to 96 bits). The 192-bit nonce space allows random generation safely for up to 2^96 messages—effectively unlimited.

wipeBuffers() zeroes all buffer regions. Every buffer in the module (keys, nonces, counters, keystream blocks, the ChaCha20 state which contains a key copy in words 4-11, Poly1305 internal state h, r, 5*r, s, chunk buffers, and XChaCha20 subkey material) gets overwritten with zeros. The TypeScript dispose() method must call this unconditionally. Key material and intermediate state must not persist in WASM memory after an operation completes.

Bare ChaCha20 is unauthenticated. The functions chachaEncryptChunk and chachaDecryptChunk provide confidentiality only. Without Poly1305 authentication, ciphertext is malleable—an attacker can flip plaintext bits by flipping ciphertext bits. Always use ChaCha20-Poly1305 AEAD or pair bare ChaCha20 with HMAC in an Encrypt-then-MAC construction.

See ChaCha20-Poly1305 implementation audit for algorithm correctness verifications.


API Reference

Buffer Offset Getters

These functions return fixed i32 offsets into linear memory. The TypeScript layer uses them to determine where to write inputs and read outputs.

Function Returns Description
getModuleId(): i32 1 Unique module identifier
getKeyOffset(): i32 0 256-bit ChaCha20 key (32 bytes)
getChachaNonceOffset(): i32 32 96-bit nonce (12 bytes, 3 x u32 LE)
getChachaCtrOffset(): i32 44 Block counter (u32)
getChachaBlockOffset(): i32 48 Keystream block output (64 bytes)
getChachaStateOffset(): i32 112 16 x u32 initial state (64 bytes)
getChunkPtOffset(): i32 176 Plaintext chunk buffer (64 KB)
getChunkCtOffset(): i32 65712 Ciphertext chunk buffer (64 KB)
getChunkSize(): i32 65536 Max chunk size in bytes
getPolyKeyOffset(): i32 131248 Poly1305 one-time key r||s (32 bytes)
getPolyMsgOffset(): i32 131280 Message staging buffer (64 bytes)
getPolyBufOffset(): i32 131344 Partial-block accumulator (16 bytes)
getPolyBufLenOffset(): i32 131360 Bytes in partial block (u32)
getPolyTagOffset(): i32 131364 Output MAC tag (16 bytes)
getPolyHOffset(): i32 131380 Accumulator h (5 x u64, 40 bytes)
getPolyROffset(): i32 131420 Clamped r (5 x u64, 40 bytes)
getPolyRsOffset(): i32 131460 Precomputed 5*r[1..4] (4 x u64, 32 bytes)
getPolySOffset(): i32 131492 s pad (4 x u32, 16 bytes)
getXChaChaNonceOffset(): i32 131508 Full 24-byte XChaCha20 nonce
getXChaChaSubkeyOffset(): i32 131532 HChaCha20 output subkey (32 bytes)
getChachaSimdWorkOffset(): i32 131568 4-wide inter-block SIMD work buffer (256 bytes)
getMemoryPages(): i32 (runtime) Current WASM linear memory size in pages

ChaCha20 Functions

chachaLoadKey(): void

Builds the 16-word ChaCha20 state matrix from the current contents of the key, nonce, and counter buffers (RFC 8439 S2.3):

State layout (16 x u32):
	words  0-3:   constants ("expand 32-byte k")
	words  4-11:  key (from KEY_OFFSET, 8 x u32 LE)
	word   12:    counter (from CHACHA_CTR_OFFSET)
	words  13-15: nonce (from CHACHA_NONCE_OFFSET, 3 x u32 LE)

Precondition: Write the 32-byte key to KEY_OFFSET, the 12-byte nonce to CHACHA_NONCE_OFFSET, and the 4-byte counter to CHACHA_CTR_OFFSET before calling.


chachaSetCounter(ctr: u32): void

Sets both the counter buffer (CHACHA_CTR_OFFSET) and word 12 of the state matrix to ctr. Use this to seek to an arbitrary block position within a stream.


chachaResetCounter(): void

Resets the counter to 1 (the standard initial counter value for encryption per RFC 8439 S2.4). Calls chachaSetCounter(1) internally.


chachaEncryptChunk(len: i32): i32

Encrypts len bytes of plaintext from CHUNK_PT_OFFSET into ciphertext at CHUNK_CT_OFFSET. Processes data in 64-byte keystream blocks. The block counter auto-increments after each block.

  • Input: len bytes at CHUNK_PT_OFFSET (1 <= len <= 65536)
  • Output: len bytes at CHUNK_CT_OFFSET
  • Returns: len on success, -1 if len is out of range
  • Side effect: Block counter advances by ceil(len / 64) blocks. Both the state matrix (word 12) and CHACHA_CTR_OFFSET are updated.

Precondition: Call chachaLoadKey() first to initialize the state matrix.


chachaDecryptChunk(len: i32): i32

Alias for chachaEncryptChunk. ChaCha20 is a stream cipher; encryption and decryption are identical (XOR with keystream). Reads from CHUNK_PT_OFFSET, writes to CHUNK_CT_OFFSET.


chachaEncryptChunk_simd(len: i32): i32

4-wide inter-block SIMD variant of chachaEncryptChunk. Processes four independent 64-byte keystream blocks simultaneously using WebAssembly v128 operations. Each v128 register lane holds word w from a different block (counter values ctr, ctr+1, ctr+2, ctr+3). For inputs >= 256 bytes, this produces 2-3× higher throughput than the scalar path on JIT-warmed V8 and SpiderMonkey runtimes.

  • Input: len bytes at CHUNK_PT_OFFSET (1 <= len <= 65536)
  • Output: len bytes at CHUNK_CT_OFFSET
  • Returns: len on success, -1 if len is out of range
  • Side effect: Block counter advances by ceil(len / 64) blocks. Both the state matrix (word 12) and CHACHA_CTR_OFFSET are updated.

SIMD inner loop (entered when processed + 256 <= len):

  1. Loads words 0-11, 13-15 from CHACHA_STATE_OFFSET as i32x4.splat vectors (all four lanes identical).
  2. Constructs the counter vector r12 = [ctr, ctr+1, ctr+2, ctr+3] using i32x4.replace_lane.
  3. Applies 10 double rounds (column + diagonal quarter-rounds) entirely in v128 registers using v128.add<i32>, v128.xor, i32x4.shl, i32x4.shr_u.
  4. Adds back the initial state words. Word 12 reconstructed from the ctr parameter to avoid storing an extra v128 for the initial counter vector.
  5. Deinterleaves via i32x4.extract_lane (64 stores) to write all four blocks sequentially into CHACHA_SIMD_WORK_OFFSET.
  6. XORs 256 bytes of plaintext using 16 v128.xor + v128.load/v128.store instructions. All three buffer addresses (PT, CT, work) are 16-byte aligned.
  7. Advances counter by 4, writes to both CHACHA_STATE_OFFSET + 48 and CHACHA_CTR_OFFSET, increments processed by 256.

Scalar tail (entered when fewer than 256 bytes remain):

Falls back to the scalar block function (computeBlock_scalar) for each remaining 64-byte block. The counter continues from where the SIMD loop left off. Partial final blocks are handled byte-by-byte.

Precondition: Call chachaLoadKey() first to initialize the state matrix. This function requires the WASM binary to be compiled with --enable simd.


chachaDecryptChunk_simd(len: i32): i32

Alias for chachaEncryptChunk_simd. ChaCha20 is a stream cipher; encryption and decryption are identical. Reads from CHUNK_PT_OFFSET, writes to CHUNK_CT_OFFSET. Provided for API symmetry; the call is forwarded directly.


chachaGenPolyKey(): void

Generates the one-time Poly1305 key by running ChaCha20 with counter = 0 (RFC 8439 S2.6). Sets state word 12 to 0, generates one keystream block, and copies the first 32 bytes to POLY_KEY_OFFSET.

Precondition: The state matrix must already contain the correct key and nonce (call chachaLoadKey() first). The counter value in the state is overwritten to 0 for this operation.

Important

This consumes block 0. After calling chachaGenPolyKey(), set the counter to 1 before encrypting plaintext to avoid keystream reuse.


hchacha20(): void

HChaCha20 subkey derivation (draft-irtf-cfrg-xchacha S2.1). Computes a 256-bit subkey from the key at KEY_OFFSET and the first 16 bytes of the nonce at XCHACHA_NONCE_OFFSET.

The state is initialized as:

	words  0-3:   constants ("expand 32-byte k")
	words  4-11:  key (from KEY_OFFSET)
	words  12-15: first 16 bytes of XChaCha20 nonce (from XCHACHA_NONCE_OFFSET)

After 10 double rounds, the output is words 0-3 and 12-15 of the working state (NOT added back to the initial state; this is the key difference from the standard ChaCha20 block function). The 32-byte result is written to XCHACHA_SUBKEY_OFFSET.

Usage in XChaCha20: The TypeScript wrapper calls hchacha20() to derive the subkey, copies it to KEY_OFFSET, constructs the inner 96-bit nonce from bytes 16-23 of the original 24-byte nonce (zero-padded to 12 bytes), then proceeds with standard ChaCha20-Poly1305.


Poly1305 Functions

polyInit(): void

Initializes Poly1305 state from the 32-byte one-time key at POLY_KEY_OFFSET (RFC 8439 S2.5).

  1. Clamps r (first 16 bytes): clears bits 4,5,6,7 of bytes 3,7,11,15 and bits 0,1 of bytes 4,8,12. This restricts r to the required form for Poly1305 security.
  2. Decomposes r into 5 radix-2^26 limbs stored at POLY_R_OFFSET.
  3. Precomputes 5*r[1..4] at POLY_RS_OFFSET (used in the multiplication step for modular reduction).
  4. Copies s (bytes 16-31 of the key) to POLY_S_OFFSET.
  5. Zeroes the accumulator h, partial-block buffer, and partial-block length.

Precondition: Write the 32-byte one-time key to POLY_KEY_OFFSET. For AEAD, this is produced by chachaGenPolyKey().

Warning

polyInit() clamps r in-place at POLY_KEY_OFFSET. The first 16 bytes of the key buffer are modified.


polyUpdate(len: i32): void

Feeds len bytes from POLY_MSG_OFFSET into the Poly1305 accumulator.

  • Handles partial blocks: data shorter than 16 bytes is buffered at POLY_BUF_OFFSET. When the buffer reaches 16 bytes, it is absorbed.
  • Full 16-byte blocks are absorbed directly from POLY_MSG_OFFSET.
  • Full blocks set the high bit (2^128) before absorption; partial blocks do not (the high bit is applied only in polyFinal's padding step).
  • Input: len bytes at POLY_MSG_OFFSET (max 64 bytes per call, matching the staging buffer size)
  • If len <= 0, returns immediately (no-op).

Can be called multiple times to process a message incrementally.


polyFinal(): void

Finalizes the Poly1305 tag and writes it to POLY_TAG_OFFSET (16 bytes).

  1. If there is a partial block in the buffer, pads it with a 0x01 byte followed by zeros and absorbs it with hibit = 0 (RFC 8439 S2.5.1).
  2. Performs a full carry chain on h to normalize all limbs.
  3. Computes the conditional subtraction: if h >= p, reduces to h - p. Uses a constant-time mask-and-select (no branching on secret values).
  4. Recombines the 5 limbs into two u64 halves (lo, hi).
  5. Adds the s pad: tag = (h + s) mod 2^128.
  6. Stores the 16-byte tag at POLY_TAG_OFFSET in little-endian.

Wipe Function

wipeBuffers(): void

Zeroes every buffer region in the module via memory.fill(). Covers:

  • ChaCha20: key (32B), nonce (12B), counter (4B), keystream block (64B), state matrix (64B)
  • Chunk buffers: plaintext (64KB), ciphertext (64KB)
  • Poly1305: one-time key (32B), message staging (64B), partial block (16B), partial block length (4B), tag (16B), accumulator h (40B), clamped r (40B), precomputed 5*r (32B), s pad (16B)
  • XChaCha20: nonce (24B), subkey (32B)
  • SIMD work buffer (256B)

Must be called by the TypeScript dispose() method to prevent key material from persisting in WASM linear memory.


Buffer Layout

All offsets are byte offsets from the start of linear memory (offset 0). The module's total memory footprint is 131,824 bytes (< 3 x 64KB pages = 192KB).

Offset Size (bytes) Name Description
0 32 KEY_BUFFER ChaCha20 256-bit key
32 12 CHACHA_NONCE_BUFFER 96-bit nonce (3 x u32, LE)
44 4 CHACHA_CTR_BUFFER u32 block counter
48 64 CHACHA_BLOCK_BUFFER 64-byte keystream block output
112 64 CHACHA_STATE_BUFFER 16 x u32 initial state matrix
176 65,536 CHUNK_PT_BUFFER Streaming plaintext input
65,712 65,536 CHUNK_CT_BUFFER Streaming ciphertext output
131,248 32 POLY_KEY_BUFFER One-time Poly1305 key (r || s)
131,280 64 POLY_MSG_BUFFER Message staging (<= 64 bytes per polyUpdate)
131,344 16 POLY_BUF_BUFFER Partial-block accumulator
131,360 4 POLY_BUF_LEN_BUFFER Bytes in partial block (u32)
131,364 16 POLY_TAG_BUFFER 16-byte output MAC tag
131,380 40 POLY_H_BUFFER Accumulator h (5 x u64 limbs)
131,420 40 POLY_R_BUFFER Clamped r (5 x u64 limbs)
131,460 32 POLY_RS_BUFFER Precomputed 5*r[1..4] (4 x u64)
131,492 16 POLY_S_BUFFER s pad (4 x u32)
131,508 24 XCHACHA_NONCE_BUFFER Full 24-byte XChaCha20 nonce
131,532 32 XCHACHA_SUBKEY_BUFFER HChaCha20 output subkey
131,564 4 (alignment padding) Pad to 16-byte boundary (131564 % 16 = 12)
131,568 256 CHACHA_SIMD_WORK_BUFFER 4-wide inter-block SIMD work buffer
131,824 END Total < 196,608 (3 pages)

Internal Architecture

The module is composed of five source files compiled into a single chacha20.wasm binary (built with --enable simd):

buffers.ts: Static Memory Layout

Defines all buffer offsets as i32 constants starting at offset 0. Exports getter functions for each offset so the TypeScript layer can query them at runtime without hardcoding addresses. Also exports getModuleId() (returns 1) and getMemoryPages() (returns memory.size()).

No dynamic allocation. No memory.grow(). The layout is fixed at compile time.

chacha20.ts: ChaCha20 Stream Cipher + HChaCha20

Implements from RFC 8439 directly:

rotl32. Left rotation (inlined). ChaCha20 uses left rotation exclusively, unlike some other ARX constructions.

qr (quarter-round, RFC 8439 S2.1). The fundamental ChaCha20 operation. Four ARX steps with rotations of 16, 12, 8, 7 bits. Operates on four u32 words at computed offsets in the state buffer.

doubleRound. One column round (indices 0,4,8,12 / 1,5,9,13 / 2,6,10,14 / 3,7,11,15) followed by one diagonal round (0,5,10,15 / 1,6,11,12 / 2,7,8,13 / 3,4,9,14). Applied 10 times for 20 total rounds.

block. The ChaCha20 block function (RFC 8439 S2.3). Copies state to the block buffer, applies 10 double rounds, then adds the original state back word-by-word. Produces one 64-byte keystream block.

chachaEncryptChunk. Streaming encryption. Iterates over the plaintext in 64-byte blocks, XORing each byte with the corresponding keystream byte. Auto-increments the counter after each block.

chachaGenPolyKey. Generates the one-time Poly1305 key by running ChaCha20 with counter = 0 (RFC 8439 S2.6).

hchacha20. HChaCha20 (draft-irtf-cfrg-xchacha S2.1). Same as the block function but (a) uses the first 16 bytes of the XChaCha20 nonce as words 12-15 instead of counter + nonce, and (b) outputs words 0-3 and 12-15 of the post-round state WITHOUT adding back the initial state.

poly1305.ts: Poly1305 MAC

Implements from RFC 8439 S2.5:

Radix-2^26 representation. Both r and h are stored as 5 limbs of up to 26 bits each, packed into u64 words. This allows the schoolbook multiplication to use u64 arithmetic without overflow. The maximum intermediate product is ~2^52 * 5, well within u64 range.

absorbBlock (internal). Absorbs one 16-byte block into the accumulator. Decomposes the block into 5 radix-2^26 limbs, adds to h, multiplies by r using the identity h[i] * r[j] mod p = h[i] * (5 * r[j]) for wrapped indices (since p = 2^130 - 5), and performs a carry chain to normalize.

polyInit. Clamps r per RFC 8439 S2.5 (certain bits must be zero for security), decomposes r and s into limb form, precomputes 5*r[1..4], and zeroes the accumulator.

polyUpdate. Feeds message bytes through the accumulator. Handles partial blocks by buffering at POLY_BUF_OFFSET. Full 16-byte blocks set the 2^128 high bit before absorption.

polyFinal. Pads and absorbs any remaining partial block (with 0x01 byte and hibit = 0), normalizes h via a full carry chain, performs the constant-time conditional reduction mod p, and adds the s pad to produce the final 16-byte tag.

chacha20_simd_4x.ts: 4-Wide Inter-Block SIMD

Implements chachaEncryptChunk_simd and chachaDecryptChunk_simd using WebAssembly v128 SIMD operations. Processes four independent ChaCha20 blocks simultaneously: each v128 register lane holds word w from a different block, giving 4× useful work per instruction.

rotl32_4x / qr_4x. Scalar helpers for the tail path, renamed to avoid namespace conflict with the same-named private functions in chacha20.ts (same AS compilation unit).

computeBlock_scalar. Scalar block function duplicated here to service the tail path (inputs not a multiple of 256 bytes). Avoids adding an exportable symbol to chacha20.ts.

block4x(ctr: u32). Core SIMD routine. Operates entirely in 16 v128 locals (enabling JIT register allocation). Reconstructs initial-state for the add-back step from CHACHA_STATE_OFFSET memory + the ctr parameter, avoiding the need to save 16 additional v128 locals.

Note on negative result: A prior attempt at intra-block SIMD (processing one block via v128 with shuffles) measured 0.60–0.72× scalar across 4 attempts. Root causes: (a) no native i32x4.rotl — each rotation costs 3 v128 ops vs 1 scalar (640 rotations per block = 3× instruction count), (b) 6 shuffles per double round for the diagonal, (c) V8/JSC already register-promotes fixed-address loads for the scalar path. See docs/chacha_simd_bench.md for full analysis. Inter-block parallelism avoids all three issues.

wipe.ts: Buffer Zeroing

Single exported function wipeBuffers() that calls memory.fill(offset, 0, size) for every buffer region. Covers key material, nonces, state, intermediate computations, chunk buffers, XChaCha20 subkey, and the SIMD work buffer. Called by the TypeScript dispose() method.

Dependency Graph

buffers.ts
	^           ^           ^           ^
	|           |           |           |
chacha20.ts   poly1305.ts   wipe.ts   chacha20_simd_4x.ts
	|           |           |           |
	+-----------+-----------+-----------+
	                        |
	                     index.ts (re-exports all)

chacha20.ts, poly1305.ts, and chacha20_simd_4x.ts are independent of each other. They all import only from buffers.ts. The AEAD composition (calling chachaGenPolyKey then feeding ciphertext through polyUpdate) happens in the TypeScript layer, not in the WASM module. wipe.ts imports buffer offsets from buffers.ts and has no dependency on any algorithm implementation.


Error Conditions

Function Condition Behavior
chachaEncryptChunk(len) len <= 0 or len > 65536 Returns -1
chachaDecryptChunk(len) Same as above (alias) Returns -1
chachaEncryptChunk_simd(len) len <= 0 or len > 65536 Returns -1
chachaDecryptChunk_simd(len) Same as above (alias) Returns -1
polyUpdate(len) len <= 0 No-op (returns immediately)
All other functions No error returns. Preconditions are the caller's responsibility.

Implicit constraints enforced by the TypeScript layer:

  • Keys must be exactly 32 bytes (256-bit).
  • ChaCha20 nonces must be exactly 12 bytes (96-bit).
  • XChaCha20 nonces must be exactly 24 bytes (192-bit).
  • polyUpdate staging buffer is 64 bytes; the TS wrapper must not write more than 64 bytes per call.
  • The block counter is u32; a single (key, nonce) pair supports at most 2^32 blocks = 256 GB of keystream. Exceeding this wraps the counter to 0, which produces keystream reuse.

Cross-References

Document Description
index Project Documentation index
architecture architecture overview, module relationships, buffer layouts, and build pipeline
chacha20 TypeScript wrapper classes (ChaCha20, Poly1305, ChaCha20Poly1305, XChaCha20Poly1305, XChaCha20Cipher)
asm_serpent alternative symmetric cipher (Serpent WASM module)
chacha_audit.md XChaCha20-Poly1305 implementation audit
⚠️ **GitHub.com Fallback** ⚠️