asm_chacha - xero/leviathan-crypto GitHub Wiki
This low-level reference details the ChaCha20 AssemblyScript source and WASM > exports, intended for those auditing, contributing to, or building against the raw module. Most consumers should instead use the TypeScript wrapper or the higher-level AEAD classes.
This module implements the full ChaCha20-Poly1305 AEAD family in one WASM binary with shared linear memory.
ChaCha20 (RFC 8439 §2.3-2.4) is a stream cipher. 256-bit key, 96-bit nonce, 32-bit block counter, 20 rounds (10 double rounds alternating column and diagonal quarter-rounds).
Poly1305 (RFC 8439 §2.5) is a one-time MAC. It authenticates messages of arbitrary length using a 256-bit one-time key (r || s). Internally it uses radix-2^26 representation with u64 limbs to avoid overflow during multiplication.
ChaCha20-Poly1305 AEAD (RFC 8439 §2.8) combines the two. The TypeScript layer orchestrates the composition by calling chachaGenPolyKey, chachaEncryptChunk, and polyInit/polyUpdate/polyFinal in sequence to produce authenticated ciphertext. The WASM module exports the primitives; TypeScript drives the construction.
HChaCha20 (draft-irtf-cfrg-xchacha §2.1) derives a 256-bit subkey from a key and 128-bit nonce prefix. XChaCha20 uses this to extend the nonce space to 192 bits, making random nonce generation practical for large message volumes.
All cryptography runs in WASM. The TypeScript layer writes inputs to linear memory, calls WASM exports, and reads outputs. It implements no algorithm logic.
Constant-time by construction. ChaCha20 uses only ARX operations (add, rotate, XOR). No lookups, no secret-dependent branches, no variable-time arithmetic. This design is why TLS 1.3 adopted ChaCha20 as its non-AES cipher—it's inherently resistant to cache-timing side channels.
Poly1305 accumulator arithmetic. Poly1305 uses radix-2^26 limbs stored in u64 words. Schoolbook multiplication over five limbs with reduction modulo p = 2^130 - 5. The u64 intermediate products avoid overflow without needing multi-precision carries. The final reduction in polyFinal uses constant-time conditional select (mask-and-OR) to choose between h and h - p, avoiding branches on secret values.
Nonce reuse is catastrophic. Reusing a (key, nonce) pair with standard ChaCha20 (96-bit nonce) leaks the XOR of two plaintexts and completely breaks Poly1305 authentication. With random 96-bit nonces, collision occurs after approximately 2^48 messages under the same key. If you require random nonces, use XChaCha20-Poly1305 instead.
XChaCha20 extends nonce to 192 bits. HChaCha20 derives a per-message subkey from the first 128 bits of a 192-bit nonce, then ChaCha20 encrypts with the remaining 64 bits (zero-padded to 96 bits). The 192-bit nonce space allows random generation safely for up to 2^96 messages—effectively unlimited.
wipeBuffers() zeroes all buffer regions. Every buffer in the module (keys, nonces, counters, keystream blocks, the ChaCha20 state which contains a key copy in words 4-11, Poly1305 internal state h, r, 5*r, s, chunk buffers, and XChaCha20 subkey material) gets overwritten with zeros. The TypeScript dispose() method must call this unconditionally. Key material and intermediate state must not persist in WASM memory after an operation completes.
Bare ChaCha20 is unauthenticated. The functions chachaEncryptChunk and chachaDecryptChunk provide confidentiality only. Without Poly1305 authentication, ciphertext is malleable—an attacker can flip plaintext bits by flipping ciphertext bits. Always use ChaCha20-Poly1305 AEAD or pair bare ChaCha20 with HMAC in an Encrypt-then-MAC construction.
See ChaCha20-Poly1305 implementation audit for algorithm correctness verifications.
These functions return fixed i32 offsets into linear memory. The TypeScript layer uses them to determine where to write inputs and read outputs.
| Function | Returns | Description |
|---|---|---|
getModuleId(): i32 |
1 |
Unique module identifier |
getKeyOffset(): i32 |
0 |
256-bit ChaCha20 key (32 bytes) |
getChachaNonceOffset(): i32 |
32 |
96-bit nonce (12 bytes, 3 x u32 LE) |
getChachaCtrOffset(): i32 |
44 |
Block counter (u32) |
getChachaBlockOffset(): i32 |
48 |
Keystream block output (64 bytes) |
getChachaStateOffset(): i32 |
112 |
16 x u32 initial state (64 bytes) |
getChunkPtOffset(): i32 |
176 |
Plaintext chunk buffer (64 KB) |
getChunkCtOffset(): i32 |
65712 |
Ciphertext chunk buffer (64 KB) |
getChunkSize(): i32 |
65536 |
Max chunk size in bytes |
getPolyKeyOffset(): i32 |
131248 |
Poly1305 one-time key r||s (32 bytes) |
getPolyMsgOffset(): i32 |
131280 |
Message staging buffer (64 bytes) |
getPolyBufOffset(): i32 |
131344 |
Partial-block accumulator (16 bytes) |
getPolyBufLenOffset(): i32 |
131360 |
Bytes in partial block (u32) |
getPolyTagOffset(): i32 |
131364 |
Output MAC tag (16 bytes) |
getPolyHOffset(): i32 |
131380 |
Accumulator h (5 x u64, 40 bytes) |
getPolyROffset(): i32 |
131420 |
Clamped r (5 x u64, 40 bytes) |
getPolyRsOffset(): i32 |
131460 |
Precomputed 5*r[1..4] (4 x u64, 32 bytes) |
getPolySOffset(): i32 |
131492 |
s pad (4 x u32, 16 bytes) |
getXChaChaNonceOffset(): i32 |
131508 |
Full 24-byte XChaCha20 nonce |
getXChaChaSubkeyOffset(): i32 |
131532 |
HChaCha20 output subkey (32 bytes) |
getChachaSimdWorkOffset(): i32 |
131568 |
4-wide inter-block SIMD work buffer (256 bytes) |
getMemoryPages(): i32 |
(runtime) | Current WASM linear memory size in pages |
Builds the 16-word ChaCha20 state matrix from the current contents of the key, nonce, and counter buffers (RFC 8439 S2.3):
State layout (16 x u32):
words 0-3: constants ("expand 32-byte k")
words 4-11: key (from KEY_OFFSET, 8 x u32 LE)
word 12: counter (from CHACHA_CTR_OFFSET)
words 13-15: nonce (from CHACHA_NONCE_OFFSET, 3 x u32 LE)
Precondition: Write the 32-byte key to KEY_OFFSET, the 12-byte nonce to
CHACHA_NONCE_OFFSET, and the 4-byte counter to CHACHA_CTR_OFFSET before
calling.
Sets both the counter buffer (CHACHA_CTR_OFFSET) and word 12 of the state
matrix to ctr. Use this to seek to an arbitrary block position within a stream.
Resets the counter to 1 (the standard initial counter value for encryption per
RFC 8439 S2.4). Calls chachaSetCounter(1) internally.
Encrypts len bytes of plaintext from CHUNK_PT_OFFSET into ciphertext at
CHUNK_CT_OFFSET. Processes data in 64-byte keystream blocks. The block counter
auto-increments after each block.
-
Input:
lenbytes atCHUNK_PT_OFFSET(1 <= len <= 65536) -
Output:
lenbytes atCHUNK_CT_OFFSET -
Returns:
lenon success,-1if len is out of range -
Side effect: Block counter advances by
ceil(len / 64)blocks. Both the state matrix (word 12) andCHACHA_CTR_OFFSETare updated.
Precondition: Call chachaLoadKey() first to initialize the state matrix.
Alias for chachaEncryptChunk. ChaCha20 is a stream cipher; encryption and decryption are identical (XOR with keystream). Reads from CHUNK_PT_OFFSET, writes to CHUNK_CT_OFFSET.
4-wide inter-block SIMD variant of chachaEncryptChunk. Processes four
independent 64-byte keystream blocks simultaneously using WebAssembly v128
operations. Each v128 register lane holds word w from a different block
(counter values ctr, ctr+1, ctr+2, ctr+3). For inputs >= 256 bytes,
this produces 2-3× higher throughput than the scalar path on JIT-warmed V8 and
SpiderMonkey runtimes.
-
Input:
lenbytes atCHUNK_PT_OFFSET(1 <= len <= 65536) -
Output:
lenbytes atCHUNK_CT_OFFSET -
Returns:
lenon success,-1if len is out of range -
Side effect: Block counter advances by
ceil(len / 64)blocks. Both the state matrix (word 12) andCHACHA_CTR_OFFSETare updated.
SIMD inner loop (entered when processed + 256 <= len):
- Loads words 0-11, 13-15 from
CHACHA_STATE_OFFSETasi32x4.splatvectors (all four lanes identical). - Constructs the counter vector
r12 = [ctr, ctr+1, ctr+2, ctr+3]usingi32x4.replace_lane. - Applies 10 double rounds (column + diagonal quarter-rounds) entirely in
v128 registers using
v128.add<i32>,v128.xor,i32x4.shl,i32x4.shr_u. - Adds back the initial state words. Word 12 reconstructed from the
ctrparameter to avoid storing an extra v128 for the initial counter vector. - Deinterleaves via
i32x4.extract_lane(64 stores) to write all four blocks sequentially intoCHACHA_SIMD_WORK_OFFSET. - XORs 256 bytes of plaintext using 16
v128.xor+v128.load/v128.storeinstructions. All three buffer addresses (PT, CT, work) are 16-byte aligned. - Advances counter by 4, writes to both
CHACHA_STATE_OFFSET + 48andCHACHA_CTR_OFFSET, incrementsprocessedby 256.
Scalar tail (entered when fewer than 256 bytes remain):
Falls back to the scalar block function (computeBlock_scalar) for each
remaining 64-byte block. The counter continues from where the SIMD loop left off.
Partial final blocks are handled byte-by-byte.
Precondition: Call chachaLoadKey() first to initialize the state matrix.
This function requires the WASM binary to be compiled with --enable simd.
Alias for chachaEncryptChunk_simd. ChaCha20 is a stream cipher; encryption and decryption are identical. Reads from CHUNK_PT_OFFSET, writes to CHUNK_CT_OFFSET. Provided for API symmetry; the call is forwarded directly.
Generates the one-time Poly1305 key by running ChaCha20 with counter = 0
(RFC 8439 S2.6). Sets state word 12 to 0, generates one keystream block, and
copies the first 32 bytes to POLY_KEY_OFFSET.
Precondition: The state matrix must already contain the correct key and nonce
(call chachaLoadKey() first). The counter value in the state is overwritten
to 0 for this operation.
Important
This consumes block 0. After calling chachaGenPolyKey(), set the
counter to 1 before encrypting plaintext to avoid keystream reuse.
HChaCha20 subkey derivation (draft-irtf-cfrg-xchacha S2.1). Computes a 256-bit
subkey from the key at KEY_OFFSET and the first 16 bytes of the nonce at
XCHACHA_NONCE_OFFSET.
The state is initialized as:
words 0-3: constants ("expand 32-byte k")
words 4-11: key (from KEY_OFFSET)
words 12-15: first 16 bytes of XChaCha20 nonce (from XCHACHA_NONCE_OFFSET)
After 10 double rounds, the output is words 0-3 and 12-15 of the working state
(NOT added back to the initial state; this is the key difference from the standard ChaCha20 block function). The 32-byte result is written to
XCHACHA_SUBKEY_OFFSET.
Usage in XChaCha20: The TypeScript wrapper calls hchacha20() to derive the
subkey, copies it to KEY_OFFSET, constructs the inner 96-bit nonce from bytes
16-23 of the original 24-byte nonce (zero-padded to 12 bytes), then proceeds
with standard ChaCha20-Poly1305.
Initializes Poly1305 state from the 32-byte one-time key at POLY_KEY_OFFSET
(RFC 8439 S2.5).
- Clamps r (first 16 bytes): clears bits 4,5,6,7 of bytes 3,7,11,15 and bits 0,1 of bytes 4,8,12. This restricts r to the required form for Poly1305 security.
-
Decomposes r into 5 radix-2^26 limbs stored at
POLY_R_OFFSET. -
Precomputes 5*r[1..4] at
POLY_RS_OFFSET(used in the multiplication step for modular reduction). -
Copies s (bytes 16-31 of the key) to
POLY_S_OFFSET. - Zeroes the accumulator h, partial-block buffer, and partial-block length.
Precondition: Write the 32-byte one-time key to POLY_KEY_OFFSET. For AEAD,
this is produced by chachaGenPolyKey().
Warning
polyInit() clamps r in-place at POLY_KEY_OFFSET. The first 16
bytes of the key buffer are modified.
Feeds len bytes from POLY_MSG_OFFSET into the Poly1305 accumulator.
- Handles partial blocks: data shorter than 16 bytes is buffered at
POLY_BUF_OFFSET. When the buffer reaches 16 bytes, it is absorbed. - Full 16-byte blocks are absorbed directly from
POLY_MSG_OFFSET. - Full blocks set the high bit (2^128) before absorption; partial blocks do not
(the high bit is applied only in
polyFinal's padding step). -
Input:
lenbytes atPOLY_MSG_OFFSET(max 64 bytes per call, matching the staging buffer size) - If
len <= 0, returns immediately (no-op).
Can be called multiple times to process a message incrementally.
Finalizes the Poly1305 tag and writes it to POLY_TAG_OFFSET (16 bytes).
- If there is a partial block in the buffer, pads it with a 0x01 byte followed by zeros and absorbs it with hibit = 0 (RFC 8439 S2.5.1).
- Performs a full carry chain on h to normalize all limbs.
- Computes the conditional subtraction: if h >= p, reduces to h - p. Uses a constant-time mask-and-select (no branching on secret values).
- Recombines the 5 limbs into two u64 halves (lo, hi).
- Adds the s pad:
tag = (h + s) mod 2^128. - Stores the 16-byte tag at
POLY_TAG_OFFSETin little-endian.
Zeroes every buffer region in the module via memory.fill(). Covers:
- ChaCha20: key (32B), nonce (12B), counter (4B), keystream block (64B), state matrix (64B)
- Chunk buffers: plaintext (64KB), ciphertext (64KB)
- Poly1305: one-time key (32B), message staging (64B), partial block (16B), partial block length (4B), tag (16B), accumulator h (40B), clamped r (40B), precomputed 5*r (32B), s pad (16B)
- XChaCha20: nonce (24B), subkey (32B)
- SIMD work buffer (256B)
Must be called by the TypeScript dispose() method to prevent key material from
persisting in WASM linear memory.
All offsets are byte offsets from the start of linear memory (offset 0). The module's total memory footprint is 131,824 bytes (< 3 x 64KB pages = 192KB).
| Offset | Size (bytes) | Name | Description |
|---|---|---|---|
| 0 | 32 | KEY_BUFFER |
ChaCha20 256-bit key |
| 32 | 12 | CHACHA_NONCE_BUFFER |
96-bit nonce (3 x u32, LE) |
| 44 | 4 | CHACHA_CTR_BUFFER |
u32 block counter |
| 48 | 64 | CHACHA_BLOCK_BUFFER |
64-byte keystream block output |
| 112 | 64 | CHACHA_STATE_BUFFER |
16 x u32 initial state matrix |
| 176 | 65,536 | CHUNK_PT_BUFFER |
Streaming plaintext input |
| 65,712 | 65,536 | CHUNK_CT_BUFFER |
Streaming ciphertext output |
| 131,248 | 32 | POLY_KEY_BUFFER |
One-time Poly1305 key (r || s) |
| 131,280 | 64 | POLY_MSG_BUFFER |
Message staging (<= 64 bytes per polyUpdate) |
| 131,344 | 16 | POLY_BUF_BUFFER |
Partial-block accumulator |
| 131,360 | 4 | POLY_BUF_LEN_BUFFER |
Bytes in partial block (u32) |
| 131,364 | 16 | POLY_TAG_BUFFER |
16-byte output MAC tag |
| 131,380 | 40 | POLY_H_BUFFER |
Accumulator h (5 x u64 limbs) |
| 131,420 | 40 | POLY_R_BUFFER |
Clamped r (5 x u64 limbs) |
| 131,460 | 32 | POLY_RS_BUFFER |
Precomputed 5*r[1..4] (4 x u64) |
| 131,492 | 16 | POLY_S_BUFFER |
s pad (4 x u32) |
| 131,508 | 24 | XCHACHA_NONCE_BUFFER |
Full 24-byte XChaCha20 nonce |
| 131,532 | 32 | XCHACHA_SUBKEY_BUFFER |
HChaCha20 output subkey |
| 131,564 | 4 | (alignment padding) | Pad to 16-byte boundary (131564 % 16 = 12) |
| 131,568 | 256 | CHACHA_SIMD_WORK_BUFFER |
4-wide inter-block SIMD work buffer |
| 131,824 | END | Total < 196,608 (3 pages) |
The module is composed of five source files compiled into a single chacha20.wasm
binary (built with --enable simd):
Defines all buffer offsets as i32 constants starting at offset 0. Exports
getter functions for each offset so the TypeScript layer can query them at
runtime without hardcoding addresses. Also exports getModuleId() (returns 1)
and getMemoryPages() (returns memory.size()).
No dynamic allocation. No memory.grow(). The layout is fixed at compile time.
Implements from RFC 8439 directly:
rotl32. Left rotation (inlined). ChaCha20 uses left rotation exclusively, unlike some other ARX constructions.
qr (quarter-round, RFC 8439 S2.1). The fundamental ChaCha20 operation. Four ARX steps with rotations of 16, 12, 8, 7 bits. Operates on four u32 words at computed offsets in the state buffer.
doubleRound. One column round (indices 0,4,8,12 / 1,5,9,13 / 2,6,10,14 / 3,7,11,15) followed by one diagonal round (0,5,10,15 / 1,6,11,12 / 2,7,8,13 / 3,4,9,14). Applied 10 times for 20 total rounds.
block. The ChaCha20 block function (RFC 8439 S2.3). Copies state to the block buffer, applies 10 double rounds, then adds the original state back word-by-word. Produces one 64-byte keystream block.
chachaEncryptChunk. Streaming encryption. Iterates over the plaintext in 64-byte blocks, XORing each byte with the corresponding keystream byte. Auto-increments the counter after each block.
chachaGenPolyKey. Generates the one-time Poly1305 key by running ChaCha20 with counter = 0 (RFC 8439 S2.6).
hchacha20. HChaCha20 (draft-irtf-cfrg-xchacha S2.1). Same as the block function but (a) uses the first 16 bytes of the XChaCha20 nonce as words 12-15 instead of counter + nonce, and (b) outputs words 0-3 and 12-15 of the post-round state WITHOUT adding back the initial state.
Implements from RFC 8439 S2.5:
Radix-2^26 representation. Both r and h are stored as 5 limbs of up to 26 bits each, packed into u64 words. This allows the schoolbook multiplication to use u64 arithmetic without overflow. The maximum intermediate product is ~2^52 * 5, well within u64 range.
absorbBlock (internal). Absorbs one 16-byte block into the accumulator. Decomposes the block into 5 radix-2^26 limbs, adds to h, multiplies by r using the identity h[i] * r[j] mod p = h[i] * (5 * r[j]) for wrapped indices (since p = 2^130 - 5), and performs a carry chain to normalize.
polyInit. Clamps r per RFC 8439 S2.5 (certain bits must be zero for security), decomposes r and s into limb form, precomputes 5*r[1..4], and zeroes the accumulator.
polyUpdate. Feeds message bytes through the accumulator. Handles partial blocks by buffering at POLY_BUF_OFFSET. Full 16-byte blocks set the 2^128 high bit before absorption.
polyFinal. Pads and absorbs any remaining partial block (with 0x01 byte and hibit = 0), normalizes h via a full carry chain, performs the constant-time conditional reduction mod p, and adds the s pad to produce the final 16-byte tag.
Implements chachaEncryptChunk_simd and chachaDecryptChunk_simd using
WebAssembly v128 SIMD operations. Processes four independent ChaCha20 blocks
simultaneously: each v128 register lane holds word w from a different block,
giving 4× useful work per instruction.
rotl32_4x / qr_4x. Scalar helpers for the tail path, renamed to avoid namespace conflict with the same-named private functions in chacha20.ts (same AS compilation unit).
computeBlock_scalar. Scalar block function duplicated here to service the tail path (inputs not a multiple of 256 bytes). Avoids adding an exportable symbol to chacha20.ts.
block4x(ctr: u32). Core SIMD routine. Operates entirely in 16 v128 locals (enabling JIT register allocation). Reconstructs initial-state for the add-back step from CHACHA_STATE_OFFSET memory + the ctr parameter, avoiding the need to save 16 additional v128 locals.
Note on negative result: A prior attempt at intra-block SIMD (processing one block via v128 with shuffles) measured 0.60–0.72× scalar across 4 attempts. Root causes: (a) no native
i32x4.rotl— each rotation costs 3 v128 ops vs 1 scalar (640 rotations per block = 3× instruction count), (b) 6 shuffles per double round for the diagonal, (c) V8/JSC already register-promotes fixed-address loads for the scalar path. Seedocs/chacha_simd_bench.mdfor full analysis. Inter-block parallelism avoids all three issues.
Single exported function wipeBuffers() that calls memory.fill(offset, 0, size)
for every buffer region. Covers key material, nonces, state, intermediate
computations, chunk buffers, XChaCha20 subkey, and the SIMD work buffer. Called
by the TypeScript dispose() method.
buffers.ts
^ ^ ^ ^
| | | |
chacha20.ts poly1305.ts wipe.ts chacha20_simd_4x.ts
| | | |
+-----------+-----------+-----------+
|
index.ts (re-exports all)
chacha20.ts, poly1305.ts, and chacha20_simd_4x.ts are independent of each other. They all import only from buffers.ts. The AEAD composition (calling
chachaGenPolyKey then feeding ciphertext through polyUpdate) happens in the
TypeScript layer, not in the WASM module. wipe.ts imports buffer offsets from
buffers.ts and has no dependency on any algorithm implementation.
| Function | Condition | Behavior |
|---|---|---|
chachaEncryptChunk(len) |
len <= 0 or len > 65536
|
Returns -1
|
chachaDecryptChunk(len) |
Same as above (alias) | Returns -1
|
chachaEncryptChunk_simd(len) |
len <= 0 or len > 65536
|
Returns -1
|
chachaDecryptChunk_simd(len) |
Same as above (alias) | Returns -1
|
polyUpdate(len) |
len <= 0 |
No-op (returns immediately) |
| All other functions | No error returns. Preconditions are the caller's responsibility. |
Implicit constraints enforced by the TypeScript layer:
- Keys must be exactly 32 bytes (256-bit).
- ChaCha20 nonces must be exactly 12 bytes (96-bit).
- XChaCha20 nonces must be exactly 24 bytes (192-bit).
-
polyUpdatestaging buffer is 64 bytes; the TS wrapper must not write more than 64 bytes per call. - The block counter is u32; a single (key, nonce) pair supports at most 2^32 blocks = 256 GB of keystream. Exceeding this wraps the counter to 0, which produces keystream reuse.
| Document | Description |
|---|---|
| index | Project Documentation index |
| architecture | architecture overview, module relationships, buffer layouts, and build pipeline |
| chacha20 | TypeScript wrapper classes (ChaCha20, Poly1305, ChaCha20Poly1305, XChaCha20Poly1305, XChaCha20Cipher) |
| asm_serpent | alternative symmetric cipher (Serpent WASM module) |
| chacha_audit.md | XChaCha20-Poly1305 implementation audit |