chacha_reference - xero/leviathan-crypto GitHub Wiki
XChaCha20-Poly1305 Algorithm Reference
Algorithm specification and known-attack survey for XChaCha20-Poly1305 AEAD, covering the ChaCha20 block function, Poly1305 MAC, HChaCha20 subkey derivation, ARX design properties, and cryptanalytic results.
Sources:
- RFC 8439 (Nir & Langley, 2018),
- draft-irtf-cfrg-xchacha-03 (Arciszewski, 2020),
- Bernstein's original ChaCha and Poly1305 papers (2005, 2008).
Table of Contents
1. Algorithm Overview
Origins and Design Goals
ChaCha20 is a stream cipher designed by Daniel J. Bernstein (University of Illinois at Chicago) in 2008, derived from his earlier Salsa20 cipher (2005, eSTREAM finalist). Poly1305 is a one-time message authentication code also designed by Bernstein, introduced in his 2005 paper "The Poly1305-AES message-authentication code." The two were combined into the AEAD construction ChaCha20-Poly1305 by Adam Langley (Google) around 2013 and subsequently standardized as RFC 8439 (June 2018), obsoleting RFC 7539.
Bernstein's stated design priorities for ChaCha20:
Software speed on constrained hardware. AES requires either dedicated hardware (AES-NI) for competitive performance, or is slow and vulnerable on platforms without it. ChaCha20 uses only 32-bit addition, XOR, and rotation — operations that are uniformly fast across ARM, MIPS, embedded, and general-purpose CPUs.
Timing side-channel resistance by construction. The ARX (Add-Rotate-XOR) design contains no data-dependent memory accesses, no table lookups indexed by secret data, and no data-dependent branches. This is a structural property of the algorithm, not an implementation technique.
Simplicity and auditability. The full cipher fits in a handful of lines of readable code.
The X prefix in XChaCha20 denotes the extended nonce variant: nonce extended from 96 bits to 192 bits via the HChaCha20 subkey derivation function. This is the variant leviathan uses, as it enables safe random nonce generation per message.
Historical Context: The Standby Cipher Problem
RFC 8439's introduction documents the motivation explicitly. AES became the near-universal encryption standard, with no widely-deployed alternative. If future cryptanalysis weakens AES, or if implementation flaws are discovered in AES-NI paths, there is no practical fallback. ChaCha20-Poly1305 was adopted by major TLS stacks (OpenSSL, BoringSSL), operating systems, and protocols (TLS 1.3, WireGuard, SSH) specifically to address this single-point-of-failure concern.
Design Lineage: Salsa20 → ChaCha20
ChaCha20 is a variant of Salsa20 with improved diffusion per round. The key structural change is the quarter round: ChaCha's mixing differs from Salsa20's, achieving better avalanche in fewer operations. Both ciphers share the same ARX core, 20-round structure, and 512-bit block size. The ChaCha family is sometimes described as "Salsa20 with better mixing."
ChaCha20-Poly1305 vs AES-GCM
| Property | XChaCha20-Poly1305 | AES-256-GCM |
|---|---|---|
| Key size | 256 bits | 128 or 256 bits |
| Nonce size | 192 bits (XChaCha) | 96 bits |
| Tag size | 128 bits | 128 bits |
| Nonce collision risk (random) | ~2⁹⁶ msgs before 50% collision | ~2⁴⁸ msgs before 50% collision (birthday) |
| Hardware acceleration required for safety | No | Yes (timing side-channels without AES-NI) |
| Software performance (no HW accel) | Fast (ARX) | Slow |
| Software performance (with HW accel) | Fast | Faster |
| Nonce misuse consequence | Keystream reuse (plaintext recovery) | Keystream reuse + auth key leak |
The critical asymmetry is in the nonce misuse row. GCM's polynomial authenticator uses the nonce to derive the authentication key (H). Nonce reuse under GCM exposes H, allowing tag forgery for all past and future messages under that key — a catastrophic failure mode beyond mere confidentiality loss. ChaCha20-Poly1305 generates a fresh one-time Poly1305 key per message using the cipher itself; nonce reuse still recovers plaintexts (bad), but does not break authentication globally.
2. Algorithm Specification
2.1 Parameters
| Parameter | Value |
|---|---|
| Key | 256 bits |
| Nonce (ChaCha20-Poly1305) | 96 bits |
| Nonce (XChaCha20-Poly1305) | 192 bits |
| Block size | 512 bits (64 bytes) |
| Counter | 32 bits, starts at 1 for AEAD (0 reserved for Poly1305 key gen) |
| Tag | 128 bits |
| Max plaintext | (2³² − 1) × 64 bytes ≈ 256 GiB |
2.2 ChaCha State
The ChaCha20 state is a 4×4 matrix of 32-bit little-endian words, laid out as:
0 1 2 3 "expa" "nd 3" "2-by" "te k" ← constants (row 0)
4 5 6 7 key[0] key[1] key[2] key[3] ← key words 0–3 (row 1)
8 9 10 11 key[4] key[5] key[6] key[7] ← key words 4–7 (row 2)
12 13 14 15 counter nonce0 nonce1 nonce2 ← counter + nonce (row 3)
The four constants are ASCII bytes of "expand 32-byte k" — the "nothing up my sleeve" string from Bernstein's original spec. As 32-bit little-endian words:
0x61707865 0x3320646e 0x79622d32 0x6b206574
The 256-bit key fills words 4–11 as eight 32-bit little-endian words. The 32-bit counter fills word 12 (initialized to 1 in AEAD mode). The 96-bit nonce fills words 13–15 as three 32-bit little-endian words.
2.3 Quarter Round
The quarter round (QR) is the atomic operation of ChaCha20. It takes four 32-bit words (a, b, c, d) and produces four new words using only addition mod 2³², XOR, and left rotation:
a += b; d ^= a; d <<<= 16;
c += d; b ^= c; b <<<= 12;
a += b; d ^= a; d <<<= 8;
c += d; b ^= c; b <<<= 7;
Where + is addition mod 2³², ^ is XOR, and <<< is 32-bit left rotation.
All four rotation constants (16, 12, 8, 7) were chosen by Bernstein to maximize diffusion: each output bit depends on every input bit as quickly as possible. The 16 and 8 rotations are particularly efficient on architectures with BSWAP/REV instructions since they reduce to byte swaps.
2.4 Block Function
The ChaCha20 block function applies 20 rounds to the 4×4 state, alternating between column rounds and diagonal rounds. Each round consists of four quarter rounds applied in parallel to four (a,b,c,d) groups.
Column round — four vertical columns:
QR(0, 4, 8, 12)
QR(1, 5, 9, 13)
QR(2, 6, 10, 14)
QR(3, 7, 11, 15)
Diagonal round — four diagonals (wrapping):
QR(0, 5, 10, 15)
QR(1, 6, 11, 12)
QR(2, 7, 8, 13)
QR(3, 4, 9, 14)
20 rounds = 10 column rounds interleaved with 10 diagonal rounds. After 20 rounds, the output state is added word-by-word (mod 2³²) to the initial state, producing the 64-byte keystream block:
output[i] = working_state[i] + initial_state[i] for i = 0..15
The final addition is critical: it prevents the function from being invertible (running the rounds backwards with only the output would recover the input without it), and it provides the "add" in the ARX chain that links the permutation output back to the key material.
2.5 Stream Encryption
ChaCha20 encryption is straightforward counter-mode stream cipher operation:
for each 64-byte block i (starting at counter = 1 in AEAD mode):
keystream_block = chacha20_block(key, counter + i, nonce)
ciphertext[i*64 : (i+1)*64] = plaintext[i*64 : (i+1)*64] ^ keystream_block
For a final partial block, only the needed bytes of the keystream block are used; the remainder is discarded.
Decryption is identical — XOR with the same keystream regenerated from the same (key, nonce, counter).
2.6 Poly1305
Poly1305 is a one-time MAC operating over GF(2¹³⁰ − 5). It takes a 32-byte one-time key and a message of arbitrary length, and produces a 16-byte tag.
Key Structure
The 32-byte one-time key is split into two 128-bit halves:
- r (bytes 0–15): the polynomial key, clamped before use
- s (bytes 16–31): the final addition constant
Clamping r: certain bits of r are forced to zero before use, ensuring the key lies in a subgroup that prevents certain algebraic attacks. The clamping rule clears bits at specific positions via byte-level masks:
r[3] &= 0x0f
r[4] &= 0xfc
r[7] &= 0x0f
r[8] &= 0xfc
r[11] &= 0x0f
r[12] &= 0xfc
r[15] &= 0x0f
All remaining bytes are unmasked (effectively & 0xff), which is correct per RFC 8439 §2.5.
MAC Algorithm
The message is processed in 16-byte chunks. Each chunk is treated as a little-endian integer with a high bit appended beyond the top byte (making the value at most 17 bytes / 136 bits). The accumulator h is maintained over GF(2¹³⁰ − 5):
h = 0
for each 16-byte block m_i:
n = little_endian_to_int(m_i) | (1 << (8 * len(m_i)))
h = (h + n) * r mod (2¹³⁰ − 5)
tag = (h + s) mod 2¹²⁸
The final partial block (if the message length is not a multiple of 16) is padded on the right with zeros, and the 1 bit is set at position 8 * actual_len rather than 8 * 16. The h + s final step serializes h back to 128 bits by discarding the upper 2 bits.
The prime 2¹³⁰ − 5 was chosen because it allows efficient arithmetic: modular reduction can be done using the identity 2¹³⁰ ≡ 5 (mod 2¹³⁰ − 5), so carrying out of the 130-bit accumulator multiplies the overflow by 5 and adds back into the lower 130 bits — a cheap operation.
2.7 Poly1305 Key Generation
In the AEAD construction, the one-time Poly1305 key is derived from the encryption key and nonce using ChaCha20 itself, rather than being provided externally. This ensures a unique MAC key per (key, nonce) pair without requiring a separate key distribution mechanism.
The derivation uses the ChaCha20 block function with counter = 0:
poly1305_key_block = chacha20_block(key, counter=0, nonce)
r = poly1305_key_block[0:16] (first 128 bits, clamped)
s = poly1305_key_block[16:32] (next 128 bits)
The remaining 32 bytes of the 64-byte block output are discarded. Counter 0 is consumed by key generation; encryption begins at counter 1.
2.8 AEAD Construction (ChaCha20-Poly1305)
The full AEAD construction authenticates additional data (AAD) alongside the ciphertext, producing a ciphertext + tag that is unforgeable without the key.
Encryption:
Input: key (32B), nonce (12B), plaintext, aad
Output: ciphertext, tag
1. poly1305_key = chacha20_block(key, counter=0, nonce)[0:32]
2. ciphertext = chacha20_encrypt(key, counter=1, nonce, plaintext)
3. mac_data = pad16(aad)
‖ pad16(ciphertext)
‖ le64(len(aad))
‖ le64(len(ciphertext))
4. tag = poly1305(poly1305_key, mac_data)
Where pad16(x) pads x with zero bytes to the next 16-byte boundary (no-op if already aligned), and le64(n) is the 8-byte little-endian encoding of integer n.
Decryption:
Input: key (32B), nonce (12B), ciphertext, tag, aad
Output: plaintext, or authentication failure
1. poly1305_key = chacha20_block(key, counter=0, nonce)[0:32]
2. Reconstruct mac_data from aad and ciphertext (same as encryption step 3)
3. expected_tag = poly1305(poly1305_key, mac_data)
4. if constant_time_compare(tag, expected_tag) fails: return FAIL
5. plaintext = chacha20_encrypt(key, counter=1, nonce, ciphertext)
Tag verification must be performed before decryption begins in implementations that allow partial output. The tag is checked before plaintext is returned to the caller in any case.
[!IMPORTANT] The MAC data layout (
pad16(aad) ‖ pad16(ciphertext) ‖ le64(len(aad)) ‖ le64(len(ciphertext))) is a specific wire format. The length fields are necessary to prevent ambiguity: without them,aad="AA", ct="BBBB"andaad="AABB", ct="BB"would produce identical mac_data.
3. XChaCha20-Poly1305
3.1 The Nonce Problem
Standard ChaCha20-Poly1305 uses a 96-bit nonce. With random nonce generation and the birthday bound, a (key, nonce) collision becomes probable after approximately 2⁴⁸ messages — a real concern for high-volume or long-lived session keys. The typical mitigation is a monotonic counter nonce, but this requires stateful nonce management, which complicates distributed or stateless systems.
XChaCha20 solves this by extending the nonce to 192 bits. With a 192-bit random nonce, the birthday collision threshold rises to 2⁹⁶ messages (roughly 7.3 × 10²⁸) — effectively infinite for any practical purpose. Applications can generate nonces at random per message and discard all nonce state.
3.2 HChaCha20
HChaCha20 is the subkey derivation function at the heart of XChaCha20. It is derived from HSalsa20 (Bernstein, 2011), adapted for the ChaCha state layout.
HChaCha20 initializes a ChaCha state identically to ChaCha20 but with the 32-bit counter position replaced by the first 32 bits of the 128-bit input nonce (the block counter slot is consumed by nonce material, and there is no counter):
ChaCha20 state layout:
[constants | key (8 words) | counter | nonce (3 words)]
HChaCha20 state layout:
[constants | key (8 words) | nonce[0..3] (4 words)]
The full 20-round ChaCha permutation is then applied. Rather than adding back the initial state at the end (as ChaCha20's block function does), HChaCha20 extracts only the first and last rows of the post-round state as the output subkey:
output = state[0..3] ‖ state[12..15] (8 × 32-bit words = 256 bits)
The selection of these specific indices (0–3 and 12–15) is not arbitrary — it mirrors the approach from HSalsa20, where the corresponding indices were chosen to make a security proof work. The proof (from the XSalsa20 paper, Bernstein 2011) shows that extracting output from a public computation at positions corresponding to the constants and nonce slots produces a PRF-secure subkey, given that the underlying permutation is secure. The same argument carries over directly to HChaCha20:
HSalsa20 output indices: 0, 5, 10, 15, 6, 7, 8, 9
HChaCha20 output indices: 0, 1, 2, 3, 12, 13, 14, 15
The middle rows (state[4..11], the key material) are deliberately discarded from the output, ensuring that even with knowledge of the subkey, the original key cannot be recovered.
HChaCha20 state initialization:
[0..3] "expand 32-byte k" constants
[4..11] 256-bit key (8 little-endian u32 words)
[12..15] first 128 bits (16 bytes) of the 192-bit XChaCha nonce
No counter. The remaining 64 bits of the 192-bit nonce are not used here — they are passed directly to ChaCha20 as described in §3.3.
3.3 XChaCha20 Nonce Split
The 192-bit XChaCha nonce is split:
- bytes 0–15 (128 bits) → HChaCha20 input nonce, used to derive the subkey
- bytes 16–23 (64 bits) → ChaCha20 nonce suffix, placed at bytes 4–11 of a zeroed 12-byte nonce:
chacha20_nonce = 0x00000000 ‖ xchacha_nonce[16:24]
The four zero bytes in the nonce prefix ensure the counter and nonce fields don't alias; the actual counter starts at 1 for AEAD encryption.
3.4 AEAD_XChaCha20_Poly1305
The full XChaCha20-Poly1305 AEAD construction:
Input: key (32B), nonce (24B), plaintext, aad
Output: ciphertext, tag
1. subkey = HChaCha20(key, nonce[0:16])
2. chacha_nonce = 0x00000000 ‖ nonce[16:24] (12 bytes)
3. (ciphertext, tag) = AEAD_ChaCha20_Poly1305(subkey, chacha_nonce, plaintext, aad)
Decryption mirrors this exactly, substituting the AEAD decrypt in step 3.
The construction is composable: once the subkey is derived, the remainder is a standard ChaCha20-Poly1305 AEAD call. Implementations that already have ChaCha20-Poly1305 need only add HChaCha20 to gain the extended nonce variant.
4. ARX Design and Implementation Properties
4.1 The ARX Primitive
ChaCha20 uses exclusively three 32-bit operations:
- A: addition mod 2³²
- R: left rotation by a fixed constant
- X: XOR
There are no S-boxes, no lookup tables, no data-dependent memory accesses, and no branches. Every operation executes unconditionally on register operands. The cipher has no notion of "lookup" at any level.
This is a structural difference from block ciphers like AES or Serpent. AES's SubBytes step — even with AES-NI — requires a table operation in software fallback paths. Serpent's bitslice S-boxes eliminate this via boolean circuits, but at the cost of significant code complexity. ChaCha20 achieves timing-channel safety through the simplicity of the operations themselves.
4.2 Side-Channel Properties
Cache-timing: None possible by construction. There are no memory accesses indexed by secret data. The state lives entirely in registers.
Branch-timing: None possible by construction. There are no data-dependent branches.
Power analysis (DPA): The ARX structure mixes key material into every word in every round, providing natural avalanche that complicates DPA correlation. However, dedicated DPA mitigation (masking, blinding) is outside the algorithm's scope and depends on the deployment platform.
Timing of Poly1305 verification: The constant-time comparison of the authentication tag is an implementation requirement, not an algorithm property. A naive memcmp on the tag allows timing oracle attacks against the MAC. Implementations must use a constant-time comparison (e.g. XOR-accumulate over all bytes with no early return).
4.3 Performance
ChaCha20's 512-bit block size processes 64 bytes per block function call, versus AES's 128-bit (16-byte) block size. For equivalent plaintext volume, ChaCha20 makes fewer block function calls.
The quarter round's four operations are highly parallelizable: on modern superscalar CPUs, multiple QRs can execute in parallel since the four column-round QRs operate on independent word groups. SIMD implementations (leviathan's 4-wide inter-block parallelism using v128) vectorize across multiple blocks simultaneously, processing four blocks in parallel.
Without hardware acceleration, ChaCha20 typically outperforms AES. With AES-NI, AES-GCM is faster on hardware that has it. On ARM (common in mobile and embedded), ChaCha20 is frequently faster even against hardware AES, due to ARM's AES instructions having higher latency than the NEON-vectorized ChaCha20 path.
5. Known Attacks
5.1 Differential Cryptanalysis
Applicability: Reduced-round only.
Standard differential cryptanalysis does not extend to 20-round ChaCha20. The best differential distinguishers in the literature reach at most 7 rounds with impractical data and time requirements. The 13-round margin beyond the best distinguisher provides substantial security.
5.2 Linear Cryptanalysis
Applicability: Not directly applicable.
Stream ciphers do not have the block cipher structure that classical linear cryptanalysis targets. Correlation attacks on the keystream (the analogue for stream ciphers) have not produced results beyond heavily reduced-round variants.
5.3 Differential-Linear Attacks (Best Known)
Best published result: Shi et al. (2012) — 7-round distinguisher requiring 2²³ chosen plaintexts. Choudhuri and Maitra (2016) achieved marginal improvements on 7-round differential-linear attacks.
No distinguisher beyond 7 rounds of ChaCha20 is published as of this writing. Full 20-round ChaCha20 has a 13-round security margin against the best known attack.
| Attack | Rounds | Data | Practical? |
|---|---|---|---|
| Differential-linear distinguisher (Shi 2012) | 7 | 2²³ CP | No |
| Differential-linear key recovery (Choudhuri & Maitra 2016) | 7 | — | No |
All results remain far from the full 20 rounds.
5.4 Rotational Cryptanalysis
Applicability: Theoretical, reduced rounds only.
Rotational cryptanalysis (Khovratovich & Nikolić, 2010) attempts to exploit the rotation constants in ARX ciphers. Applied to ChaCha20, no results beyond heavily reduced rounds have been published.
5.5 Polynomial Attacks on Poly1305
Applicability: Implementation-specific; requires nonce reuse.
The one-time nature of the Poly1305 key (fresh derivation per message) is the primary security assumption. If the Poly1305 key is ever reused — due to nonce reuse in the ChaCha20 key derivation step — two messages share the same (r, s), allowing an algebraic attack to recover r from the two tag/ciphertext pairs and then forge tags.
The attack: given tag₁ = h(m₁) * r + s and tag₂ = h(m₂) * r + s, compute tag₁ − tag₂ = (h(m₁) − h(m₂)) * r, and solve for r. This directly recovers the polynomial key and enables universal forgery.
Mitigation: XChaCha20's 192-bit random nonce makes nonce reuse negligible in probability. The attack is only relevant if nonce generation is broken or deterministic nonce management fails.
5.6 Nonce Misuse
Consequence: Keystream reuse + Poly1305 key exposure.
Nonce reuse (encrypting two messages with the same key and nonce) under ChaCha20-Poly1305 produces two ciphertexts XOR'd with identical keystreams. Crib-dragging (known-plaintext XOR) recovers both plaintexts. Additionally, as described in §5.5, the shared Poly1305 key is recoverable, enabling authentication bypass.
This is a misuse attack, not a cryptanalytic one; the algorithm provides no nonce-misuse resistance by design. XChaCha20's extended nonce is the correct mitigation.
Birthday-bound analysis for nonce collision under XChaCha20:
| Nonce size | 50% collision after | At 10⁹ msgs/sec, time to 50% |
|---|---|---|
| 96-bit (ChaCha20) | ~2⁴⁸ msgs | ~9 days |
| 192-bit (XChaCha20) | ~2⁹⁶ msgs | ~2.5 × 10¹⁸ years |
5.7 Forgery Bound
The forgery probability for Poly1305 is bounded by:
P_forgery ≤ ⌈l/16⌉ / 2¹⁰⁶
where l is the message length in bytes.
| Message size | Blocks | Forgery probability |
|---|---|---|
| 64 bytes | 4 | 4 / 2¹⁰⁶ ≈ 2⁻¹⁰⁴ |
| 1 KB | 64 | 2⁶ / 2¹⁰⁶ = 2⁻¹⁰⁰ |
| 64 KB | 4,096 | 2¹² / 2¹⁰⁶ = 2⁻⁹⁴ |
| 1 MB | 65,536 | 2¹⁶ / 2¹⁰⁶ = 2⁻⁹⁰ |
5.8 Overall Security Assessment
XChaCha20-Poly1305 security posture as of 2026:
| Attack class | Best result | Threat level |
|---|---|---|
| Differential cryptanalysis | 7 rounds (infeasible data) | None |
| Linear / correlation | No meaningful results | None |
| Differential-linear (best) | 7 rounds, 13-round margin | None |
| Rotational | Reduced rounds only | None |
| Poly1305 algebraic | Requires nonce reuse | None (with XChaCha nonce) |
| Nonce misuse | Keystream reuse + MAC break | Mitigated by 192-bit nonce |
| Side-channel (cache/branch) | None by construction | None at algorithm level |
Security margin: The best published attacks reach 7 of 20 rounds — a 13-round margin. ChaCha20 has been deployed at enormous scale (TLS 1.3, WireGuard, Signal, Android full-disk encryption) with no known practical weaknesses in the full-round construction.
Cross-References
| Document | Description |
|---|---|
| index | Project Documentation index |
| chacha_audit | security audit results: algorithm correctness verification and side-channel analysis |
| chacha20 | TypeScript API for XChaCha20-Poly1305 (ChaCha, Seal, raw modes) |
| asm_chacha | WASM implementation: quarter round, block function, Poly1305, AEAD in AssemblyScript |