serpent_reference - xero/leviathan-crypto GitHub Wiki
Serpent-256 Algorithm Reference
Algorithm specification, bitslice implementation details, and known attack analysis for Serpent-256. Based on the Anderson/Biham/Knudsen AES submission papers and Ross Anderson's reference implementation.
Table of Contents
Algorithm Overview
Origins and Design Goals
Serpent is a symmetric block cipher designed by Ross Anderson (University of Cambridge), Eli Biham (Technion), and Lars Knudsen (University of Bergen). It was submitted to the NIST Advanced Encryption Standard (AES) competition in 1998 as a candidate to replace DES.
The designers' explicit priority was maximum security margin over performance. While Rijndael (the eventual AES winner) was designed to be fast on a wide range of hardware, Serpent prioritized security through a conservative design, using 32 rounds rather than Rijndael's 10/12/14. The designers believed Serpent was the more secure submission and that NIST selected Rijndael primarily for its superior performance.
AES Competition Placement
Serpent was selected as one of the five AES finalists, alongside Rijndael, Twofish, RC6, and MARS. In the final NIST vote it placed second, very close to Rijndael. Many cryptographers and security practitioners who prioritize security margin over performance prefer Serpent to AES for that reason.
Bitslice Design
Serpent was designed around the bitslice implementation technique, introduced by Eli Biham for DES. In a bitslice implementation, multiple blocks are encrypted in parallel by treating each bit position across all blocks as a single 32-bit word. The S-boxes are then implemented as Boolean circuits (AND, OR, XOR) operating on these wide words, rather than as table lookups.
This choice was deliberate and has two key consequences:
-
Software performance: A bitslice implementation processes 32 blocks simultaneously using integer registers, achieving throughput that exceeds a naive implementation despite the more complex S-box computation.
-
Timing side-channel resistance: The S-boxes use only data-independent Boolean operations with no memory accesses indexed by secret data, so there are no cache-timing side channels in the S-box layer. The original paper explicitly claims timing attacks are "not applicable" to Serpent, though this depends entirely on using the bitslice implementation. A table-lookup implementation would be vulnerable.
Serpent vs AES
| Property | Serpent-256 | AES-256 |
|---|---|---|
| Rounds | 32 | 14 |
| Best known cryptanalytic attack | 12 rounds (theoretical) | 9 rounds (theoretical) |
| Security margin (rounds) | 20 | 5 |
| Relative software speed | ~3-4x slower | Faster |
| S-box timing channel risk | None (bitslice) | Yes (if not using AES-NI) |
Serpent's 20-round security margin versus AES's 5-round margin is the central argument for choosing it in high-security applications where performance is less critical.
Algorithm Specification
Parameters
| Parameter | Value |
|---|---|
| Block size | 128 bits |
| Key sizes | 128, 192, or 256 bits |
| Number of rounds | 32 |
| Subkeys | 33 (K₀ through K₃₂), each 128 bits |
| S-boxes | 8 distinct 4-bit to 4-bit permutations |
Key Schedule
Key Padding
All key sizes are padded to 256 bits before processing. The padding rule appends a
single 1 bit immediately after the last key bit, then zeros to fill to 256 bits:
- 128-bit key: bit 128 set to 1, bits 129-255 set to 0
- 192-bit key: bit 192 set to 1, bits 193-255 set to 0
- 256-bit key: no padding required
The padded key is treated as eight 32-bit words: w₋₈, w₋₇, ..., w₋₁.
Prekey Expansion
132 prekey words w₀ through w₁₃₁ are generated by the following affine recurrence:
w_i = (w_{i-8} ⊕ w_{i-5} ⊕ w_{i-3} ⊕ w_{i-1} ⊕ φ ⊕ i) <<< 11
Where:
<<<denotes 32-bit left rotationφ = 0x9E3779B9(the fractional part of the golden ratio: (√5−1)/2 × 2³²)iis the word index (0 through 131)- The XOR with
iensures each word uses a unique constant, preventing slide attacks
The constant 0x9E3779B9 is the same constant used in TEA, XTEA, and many other ciphers for this purpose. The underlying recurrence polynomial is x⁸ + x⁷ + x⁵ + x³ + 1, which is primitive over GF(2), ensuring good mixing.
Round Key Derivation
The 132 prekey words are grouped into 33 groups of 4 and passed through the S-boxes (in bitslice mode) to produce the 33 subkeys K₀ through K₃₂:
{k_{4n}, k_{4n+1}, k_{4n+2}, k_{4n+3}} = S_{(3-n) mod 8}(w_{4n}, w_{4n+1}, w_{4n+2}, w_{4n+3})
For n = 0, 1, ..., 32. The S-box applied at group n is S_{(3−n) mod 8}:
| Group n | S-box | Subkey |
|---|---|---|
| 0 | S₃ | K₀ |
| 1 | S₂ | K₁ |
| 2 | S₁ | K₂ |
| 3 | S₀ | K₃ |
| 4 | S₇ | K₄ |
| 5 | S₆ | K₅ |
| 6 | S₅ | K₆ |
| 7 | S₄ | K₇ |
| 8 | S₃ | K₈ |
| ... | (repeats with period 8) | ... |
| 32 | S₃ | K₃₂ |
Each group of four k-words is concatenated to form a 128-bit subkey: K_n = k_{4n} ‖ k_{4n+1} ‖ k_{4n+2} ‖ k_{4n+3}.
S-Boxes
Serpent uses 8 distinct 4-bit to 4-bit substitution boxes (S₀ through S₇). Each S-box maps a 4-bit input to a 4-bit output. A given S-box is used in every 8th round; round i uses S_{i mod 8}.
The S-boxes were designed using DES S-box matrix rows as a starting point, shuffled until the following cryptographic criteria were satisfied:
- Differential characteristic probability ≤ 1/4 (for any single S-box)
- No one-bit input difference ever produces a one-bit output difference
- Linear characteristic probability in the range 1/2 ± 1/4
- Single-bit linear relation probability in the range 1/2 ± 1/8
- Nonlinear order of each output bit = 3 (maximum possible for a 4-bit function)
The complete S-box tables (source: serpent-tables.h from Ross Anderson's reference
implementation):
S₀: { 3, 8,15, 1,10, 6, 5,11,14,13, 4, 2, 7, 0, 9,12 }
S₁: {15,12, 2, 7, 9, 0, 5,10, 1,11,14, 8, 6,13, 3, 4 }
S₂: { 8, 6, 7, 9, 3,12,10,15,13, 1,14, 4, 0,11, 5, 2 }
S₃: { 0,15,11, 8,12, 9, 6, 3,13, 1, 2, 4,10, 7, 5,14 }
S₄: { 1,15, 8, 3,12, 0,11, 6, 2, 5, 4,10, 9,14, 7,13 }
S₅: {15, 5, 2,11, 4,10, 9,12, 0, 3,14, 8,13, 6, 7, 1 }
S₆: { 7, 2,12, 5, 8, 4, 6,11,14, 9, 1,15,13, 3,10, 0 }
S₇: { 1,13,15, 0,14, 8, 2,11, 7, 4,12,10, 9, 3, 5, 6 }
Read each table as: S_n[input] = output. For example, S₀[0] = 3, S₀[1] = 8, S₀[15] = 12.
The inverse S-boxes (used during decryption):
S₀⁻¹: {13, 3,11, 0,10, 6, 5,12, 1,14, 4, 7,15, 9, 8, 2 }
S₁⁻¹: { 5, 8, 2,14,15, 6,12, 3,11, 4, 7, 9, 1,13,10, 0 }
S₂⁻¹: {12, 9,15, 4,11,14, 1, 2, 0, 3, 6,13, 5, 8,10, 7 }
S₃⁻¹: { 0, 9,10, 7,11,14, 6,13, 3, 5,12, 2, 4, 8,15, 1 }
S₄⁻¹: { 5, 0, 8, 3,10, 9, 7,14, 2,12,11, 6, 4,15,13, 1 }
S₅⁻¹: { 8,15, 2, 9, 4, 1,13,14,11, 6, 5, 3, 7,12,10, 0 }
S₆⁻¹: {15,10, 1,13, 5, 3, 6, 0, 4, 9,14, 7, 2,12, 8,11 }
S₇⁻¹: { 3, 0, 6,13, 9,14,15, 8, 5,12,11, 7,10, 1, 4, 2 }
Linear Transform
The linear transform (LT) is applied after each S-box layer in rounds 0-30. It takes four 32-bit words (X₀, X₁, X₂, X₃) and produces four new words. The operation uses only 32-bit rotations and XORs, with no table lookups and no data-dependent branches.
Forward linear transform (encryption):
1. X₀ = X₀ <<< 13
2. X₂ = X₂ <<< 3
3. X₁ = X₁ ⊕ X₀ ⊕ X₂
4. X₃ = X₃ ⊕ X₂ ⊕ (X₀ << 3)
5. X₁ = X₁ <<< 1
6. X₃ = X₃ <<< 7
7. X₀ = X₀ ⊕ X₁ ⊕ X₃
8. X₂ = X₂ ⊕ X₃ ⊕ (X₁ << 7)
9. X₀ = X₀ <<< 5
10. X₂ = X₂ <<< 22
Where <<< is 32-bit left rotation and << is logical left shift (no rotation).
Inverse linear transform (decryption):
The inverse undoes each step in reverse order, replacing each rotation by r with a rotation by (32 − r):
1. X₂ = X₂ <<< 10 (undoes step 10: 32 − 22 = 10)
2. X₀ = X₀ <<< 27 (undoes step 9: 32 − 5 = 27)
3. X₂ = X₂ ⊕ X₃ ⊕ (X₁ << 7) (undoes step 8)
4. X₀ = X₀ ⊕ X₁ ⊕ X₃ (undoes step 7)
5. X₃ = X₃ <<< 25 (undoes step 6: 32 − 7 = 25)
6. X₁ = X₁ <<< 31 (undoes step 5: 32 − 1 = 31)
7. X₃ = X₃ ⊕ X₂ ⊕ (X₀ << 3) (undoes step 4)
8. X₁ = X₁ ⊕ X₀ ⊕ X₂ (undoes step 3)
9. X₂ = X₂ <<< 29 (undoes step 2: 32 − 3 = 29)
10. X₀ = X₀ <<< 19 (undoes step 1: 32 − 13 = 19)
Round Function
Each of the 32 rounds (indexed 0 through 31) applies three operations in sequence:
Rounds 0-30:
- Key mixing: XOR the current 128-bit state with subkey Kᵢ
- S-box layer: Apply S_{i mod 8} to all 32 parallel 4-bit groups
- Linear transform: Apply the forward LT to the four 32-bit state words
Round 31 (final round):
- Key mixing: XOR the state with K₃₁
- S-box layer: Apply S_{31 mod 8} = S₇ to all groups
- (No linear transform) Final key mixing: XOR the state with K₃₂
The omission of the linear transform in the final round and its replacement with a second key addition is the standard SP-network construction that ensures decryption is a mirror image of encryption.
Encryption
Complete encryption sequence:
Input: 128-bit plaintext P, 256-bit key K
Output: 128-bit ciphertext C
1. Derive subkeys K₀..K₃₂ from K via the key schedule
2. B₀ = P
3. For i = 0 to 30:
Bᵢ₊₁ = LT(S_{i mod 8}(Bᵢ ⊕ Kᵢ))
4. C = S₇(B₃₁ ⊕ K₃₁) ⊕ K₃₂
In the bitslice description, steps 2-4 operate on the four 32-bit words representing the block. There are no IP or FP permutations in the bitslice form; they are absorbed into the key schedule.
Decryption
Decryption applies every operation in reverse order with inverse functions:
Input: 128-bit ciphertext C, 256-bit key K
Output: 128-bit plaintext P
1. Derive subkeys K₀..K₃₂ from K (same key schedule as encryption)
2. B₃₂ = C ⊕ K₃₂
3. B₃₁ = S₇⁻¹(B₃₂) ⊕ K₃₁
4. For i = 30 downto 0:
Bᵢ = S_{i mod 8}⁻¹(LT⁻¹(Bᵢ₊₁)) ⊕ Kᵢ
5. P = B₀
The key schedule is identical for encryption and decryption. Subkeys are derived once and used in reverse order during decryption.
Bitslice Implementation
The Bitslice Technique
A conventional block cipher implementation processes each block sequentially, passing one 128-bit value through the S-boxes one step at a time. The S-boxes are implemented as lookup tables indexed by small groups of bits.
A bitslice implementation processes n blocks in parallel by transposing the
data representation. Rather than storing one 128-bit block as a unit, each of the
128 bit positions across all n blocks is stored in a single n-bit register. For
a 32-bit architecture, n = 32, so 32 blocks are processed simultaneously.
Under this representation, the S-box is no longer a table lookup. Instead, it is implemented as a Boolean circuit of AND, OR, XOR, and NOT operations on the wide registers, computing all four output bits simultaneously for all 32 blocks.
How It Works for Serpent
A 128-bit Serpent block consists of four 32-bit words: (X₀, X₁, X₂, X₃). In the bitslice representation processing 32 blocks simultaneously, these four words become four 32-bit registers where each bit position corresponds to one of the 32 parallel blocks.
The S-box receives four one-bit inputs (one from each of X₀, X₁, X₂, X₃ at the same bit position) and produces four one-bit outputs. The Boolean circuit for each S-box is derived algebraically from the 4-bit to 4-bit lookup table; it computes the same mapping as the table but using only bitwise operations.
For example, S₀ maps a 4-bit input to a 4-bit output using the substitution
{3,8,15,1,10,6,5,11,14,13,4,2,7,0,9,12}. The Boolean circuit that computes
this mapping for all 32 blocks in parallel is a sequence of roughly 15-20 gate
operations (AND, OR, XOR, NOT) on the four input registers.
No Cache-Timing Channels
A table-lookup S-box implementation has a fundamental vulnerability: the memory address accessed depends on the secret data (plaintext XOR key). On a processor with a data cache, access time varies based on whether the cache line containing the table entry is warm or cold. An attacker who can measure timing or shares a cache with the victim can learn information about the key.
The Serpent authors note that the bitslice S-box "does not access different locations in memory for different plaintexts or keys." All 32 gate operations execute unconditionally and access no memory other than the four register operands. There is nothing for a cache-timing adversary to observe.
This property holds at the S-box level. The linear transform (rotations and XORs) is also data-independent by nature. The Serpent core produces no data-dependent memory accesses whatsoever.
Performance
The bitslice technique processes 32 blocks in the time it takes to encrypt one block conventionally. The throughput advantage is therefore up to 32x, modulo the overhead of transposing data in and out of bitslice format.
For bulk encryption (e.g. CTR or CBC mode over large data), the amortized throughput approaches this theoretical maximum. For single-block operations, the overhead of the format conversion dominates and throughput is lower.
The original paper quotes a bitslice implementation at ~11,000 clock cycles per block at 3.5 MHz (~40.7 Kbit/s), versus ~34,000 cycles for a compact table-lookup implementation (~12.8 Kbit/s). Counterintuitively, bitslice is ~3x faster in bulk throughput despite its more complex S-box logic, because the table-lookup version is not vectorized.
Bitslice vs Table-Lookup
| Property | Bitslice | Table-Lookup |
|---|---|---|
| S-box implementation | Boolean circuit (AND/OR/XOR) | Memory table indexed by secret |
| Cache-timing side channels | None | Present if cache is shared |
| Throughput (bulk) | High (32 blocks in parallel) | Low (1 block at a time) |
| Single-block latency | Higher (format conversion overhead) | Lower |
| Code size | Larger | Smaller |
| Data-independent execution | Yes | No |
Known Attacks
Differential Cryptanalysis
Source: Original AES submission paper (Anderson/Biham/Knudsen, 1998) Rounds reached: 24 (theoretical characteristic) Type: Classical differential cryptanalysis
The best differential characteristic found by the designers covers 24 rounds with probability ≤ 2⁻²³². Mounting an attack requires more plaintext pairs than exist in the 2¹²⁸ possible values, making the attack physically impossible. The designers applied the characteristic analysis at the single-S-box level (probability ≤ 1/4) and showed that the cascade through multiple rounds reduces the probability exponentially.
Practical threat to 32-round Serpent: None. The attack does not reach full 32 rounds and requires infeasible data quantities even for 24 rounds.
Linear Cryptanalysis
Source: Original AES submission paper (Anderson/Biham/Knudsen, 1998) Rounds reached: 28 (theoretical linear hull) Type: Classical linear cryptanalysis
The best linear hull found by the designers for 28 rounds has a bias of approximately 1/2 ± 2⁻¹²⁰. An attack exploiting this bias would require approximately 2²⁴⁰ known plaintexts, far more than can be encrypted with a single key in any realistic scenario.
Practical threat to 32-round Serpent: None. The attack does not reach full 32 rounds and the data requirement is astronomically large.
Multidimensional Linear Cryptanalysis
Source: "Improving the Algorithm 2 in Multidimensional Linear Cryptanalysis" (Nguyen, Wu, Wang, ACISP 2011) Rounds reached: 12 Type: Multidimensional linear cryptanalysis using FFT/Walsh-Hadamard Transform
This paper improved upon Matsui's Algorithm 2 by using the Fast Walsh-Hadamard Transform (FWHT) to evaluate all 2^k linear approximations simultaneously. Applied to Serpent, it represents the best published linear-family attack as of 2011.
| Configuration | Rounds | Data | Time | Memory |
|---|---|---|---|---|
| Method 1 | 11 | 2¹¹⁶ KP | 2¹⁰⁷·⁵ | 2¹⁰⁴ |
| Method 2 | 12 | 2¹¹⁸ KP | 2²²⁸·⁸ | 2²²⁸ |
| Method 1 | 12 | 2¹¹⁶ KP | 2²³⁷·⁵ | 2¹²¹ |
| Prior best (linear) | 11 | 2¹¹⁸ KP | 2¹¹⁴·³ | 2¹⁰⁸ |
| Prior best (diff-linear) | 12 | 2¹²³·⁵ CP | 2²⁴⁹·⁴ | 2¹²⁸·⁵ |
KP = Known Plaintexts; CP = Chosen Plaintexts.
The 12-round results, while faster than brute force for reduced-round Serpent, require time complexity of 2²²⁸-2²³⁷, which is effectively infeasible. More importantly, these attacks reach only 12 of the 32 rounds. Full Serpent has 20 more rounds of security margin beyond the reach of these attacks.
Practical threat to 32-round Serpent: None. These are mathematical results on reduced-round variants only.
Biclique Cryptanalysis
Source: "The First Biclique Cryptanalysis of Serpent-256" (de Carvalho & Kowada, unpublished competition entry) Rounds reached: 32 (full) Type: Biclique cryptanalysis (meet-in-the-middle variant)
This is the only published attack that applies to the full 32-round Serpent-256. Biclique cryptanalysis (the same technique applied to AES by Bogdanov et al. in 2011) partitions the cipher into two parts and uses a biclique structure to reduce the effective key search complexity.
Results for full 32-round Serpent-256:
| Attack | Time | Data (chosen CTs) | Memory |
|---|---|---|---|
| Dim-4 biclique | 2²⁵⁵·²¹ | 2⁸⁸ | 2¹⁴·⁸² bytes |
| Dim-8 biclique | 2²⁵⁵·⁴⁵ | 2⁶⁰ | 2¹⁸·⁸⁶ bytes |
Both attacks are marginally faster than exhaustive key search (2²⁵⁶), by approximately 0.55-0.79 bits. However:
- 2⁶⁰-2⁸⁸ chosen ciphertexts is physically impossible to obtain under a single key
- Time complexities of 2²⁵⁵ are not meaningfully less secure than 2²⁵⁶
- These attacks provide no practical cryptanalytic leverage
[!WARNING] The biclique paper contains a formula error in the key schedule description: it states
k_i = SBox_{(3-(i mod 33)) mod 32}(w_i), which is incorrect. The correct formula isS_{(3−n) mod 8}for the n-th group of four prekey words (verified against the original AES submission paper and the reference C implementation). The error does not affect the attack results; it is a documentation issue only.
Practical threat to 32-round Serpent-256: None. The attack is 0.55-0.79 bits better than brute force and requires infeasible data and computation.
Related-Key Attacks
Applicability: Not applicable to Serpent
The Serpent key schedule XORs the round index i into every prekey word and applies
a different S-box at each of the 33 key schedule groups. This makes the key schedule
highly non-linear and key-dependent in a way that defeats standard related-key
differential attacks. The original designers explicitly analyzed and dismissed this
attack class.
Power Analysis and DPA
Applicability: Implementation-dependent
Differential Power Analysis (DPA) attacks observe the power consumption or electromagnetic emissions of a device during encryption to recover key bits. For Serpent:
- Bitslice implementations use every bit of every operand in every operation, causing fast data-key avalanche and reducing DPA correlation
- The code execution path does not vary with the data values (no data-dependent branches)
- These properties provide DPA resistance, but genuine DPA mitigation requires hardware countermeasures (masking, shuffling) that are outside Serpent's scope
Overall Security Assessment
Serpent-256 with 32 rounds has the following security posture as of 2026:
| Attack class | Best result | Threat level |
|---|---|---|
| Differential | 24 rounds (infeasible data) | None |
| Linear | 28 rounds (infeasible data) | None |
| Multidim. linear | 12 rounds | None |
| Biclique (full) | 2²⁵⁵·²¹ (0.79 bits vs brute force) | None |
| Related-key | Not applicable | None |
| Side-channel (bitslice) | No cache-timing in S-boxes | None at algorithm level |
Security margin: The best mathematical attack reaches 12 of 32 rounds, leaving a margin of 20 rounds, wider than AES-256's 5-round margin. No practical attack on full Serpent-256 is known. A correct 32-round implementation is cryptographically secure for all foreseeable use cases.
Cross-References
| Document | Description |
|---|---|
| index | Project Documentation index |
| serpent_audit | security audit results: algorithm correctness verification and side-channel analysis |
| serpent | TypeScript API for Serpent-256 (SerpentCipher, Seal, raw modes) |
| asm_serpent | WASM implementation: bitslice S-boxes, key schedule, CTR/CBC in AssemblyScript |