serpent_reference - xero/leviathan-crypto GitHub Wiki

Serpent-256 Algorithm Reference

Algorithm specification, bitslice implementation details, and known attack analysis for Serpent-256. Based on the Anderson/Biham/Knudsen AES submission papers and Ross Anderson's reference implementation.

Table of Contents

Algorithm Overview

Origins and Design Goals

AES Competition Placement

Bitslice Design

Serpent vs AES

Algorithm Specification

Parameters

Key Schedule

S-Boxes

Linear Transform

Round Function

Encryption

Decryption

Bitslice Implementation

The Bitslice Technique

How It Works for Serpent

No Cache-Timing Channels

Performance

Bitslice vs Table-Lookup

Known Attacks

Differential Cryptanalysis

Linear Cryptanalysis

Multidimensional Linear Cryptanalysis

Biclique Cryptanalysis

Related-Key Attacks

Power Analysis and DPA

Overall Security Assessment

Algorithm Overview

Origins and Design Goals

Serpent is a symmetric block cipher designed by Ross Anderson (University of Cambridge), Eli Biham (Technion), and Lars Knudsen (University of Bergen). It was submitted to the NIST Advanced Encryption Standard (AES) competition in 1998 as a candidate to replace DES.

The designers' explicit priority was maximum security margin over performance. While Rijndael (the eventual AES winner) was designed to be fast on a wide range of hardware, Serpent prioritized security through a conservative design, using 32 rounds rather than Rijndael's 10/12/14. The designers believed Serpent was the more secure submission and that NIST selected Rijndael primarily for its superior performance.

AES Competition Placement

Serpent was selected as one of the five AES finalists, alongside Rijndael, Twofish, RC6, and MARS. In the final NIST vote it placed second, very close to Rijndael. Many cryptographers and security practitioners who prioritize security margin over performance prefer Serpent to AES for that reason.

Bitslice Design

Serpent was designed around the bitslice implementation technique, introduced by Eli Biham for DES. In a bitslice implementation, multiple blocks are encrypted in parallel by treating each bit position across all blocks as a single 32-bit word. The S-boxes are then implemented as Boolean circuits (AND, OR, XOR) operating on these wide words, rather than as table lookups.

This choice was deliberate and has two key consequences:

Software performance: A bitslice implementation processes 32 blocks simultaneously using integer registers, achieving throughput that exceeds a naive implementation despite the more complex S-box computation.
Timing side-channel resistance: The S-boxes use only data-independent Boolean operations with no memory accesses indexed by secret data, so there are no cache-timing side channels in the S-box layer. The original paper explicitly claims timing attacks are "not applicable" to Serpent, though this depends entirely on using the bitslice implementation. A table-lookup implementation would be vulnerable.

Serpent vs AES

Property	Serpent-256	AES-256
Rounds	32	14
Best known cryptanalytic attack	12 rounds (theoretical)	9 rounds (theoretical)
Security margin (rounds)	20	5
Relative software speed	~3-4x slower	Faster
S-box timing channel risk	None (bitslice)	Yes (if not using AES-NI)

Serpent's 20-round security margin versus AES's 5-round margin is the central argument for choosing it in high-security applications where performance is less critical.

Algorithm Specification

Parameters

Parameter	Value
Block size	128 bits
Key sizes	128, 192, or 256 bits
Number of rounds	32
Subkeys	33 (K₀ through K₃₂), each 128 bits
S-boxes	8 distinct 4-bit to 4-bit permutations

Key Schedule

Key Padding

All key sizes are padded to 256 bits before processing. The padding rule appends a single 1 bit immediately after the last key bit, then zeros to fill to 256 bits:

128-bit key: bit 128 set to 1, bits 129-255 set to 0
192-bit key: bit 192 set to 1, bits 193-255 set to 0
256-bit key: no padding required

The padded key is treated as eight 32-bit words: w₋₈, w₋₇, ..., w₋₁.

Prekey Expansion

132 prekey words w₀ through w₁₃₁ are generated by the following affine recurrence:

w_i = (w_{i-8} ⊕ w_{i-5} ⊕ w_{i-3} ⊕ w_{i-1} ⊕ φ ⊕ i) <<< 11

Where:

<<< denotes 32-bit left rotation
φ = 0x9E3779B9 (the fractional part of the golden ratio: (√5−1)/2 × 2³²)
i is the word index (0 through 131)
The XOR with i ensures each word uses a unique constant, preventing slide attacks

The constant 0x9E3779B9 is the same constant used in TEA, XTEA, and many other ciphers for this purpose. The underlying recurrence polynomial is x⁸ + x⁷ + x⁵ + x³ + 1, which is primitive over GF(2), ensuring good mixing.

Round Key Derivation

The 132 prekey words are grouped into 33 groups of 4 and passed through the S-boxes (in bitslice mode) to produce the 33 subkeys K₀ through K₃₂:

{k_{4n}, k_{4n+1}, k_{4n+2}, k_{4n+3}} = S_{(3-n) mod 8}(w_{4n}, w_{4n+1}, w_{4n+2}, w_{4n+3})

For n = 0, 1, ..., 32. The S-box applied at group n is S_{(3−n) mod 8}:

Group n	S-box	Subkey
0	S₃	K₀
1	S₂	K₁
2	S₁	K₂
3	S₀	K₃
4	S₇	K₄
5	S₆	K₅
6	S₅	K₆
7	S₄	K₇
8	S₃	K₈
...	(repeats with period 8)	...
32	S₃	K₃₂

Each group of four k-words is concatenated to form a 128-bit subkey: K_n = k_{4n} ‖ k_{4n+1} ‖ k_{4n+2} ‖ k_{4n+3}.

S-Boxes

Serpent uses 8 distinct 4-bit to 4-bit substitution boxes (S₀ through S₇). Each S-box maps a 4-bit input to a 4-bit output. A given S-box is used in every 8th round; round i uses S_{i mod 8}.

The S-boxes were designed using DES S-box matrix rows as a starting point, shuffled until the following cryptographic criteria were satisfied:

Differential characteristic probability ≤ 1/4 (for any single S-box)
No one-bit input difference ever produces a one-bit output difference
Linear characteristic probability in the range 1/2 ± 1/4
Single-bit linear relation probability in the range 1/2 ± 1/8
Nonlinear order of each output bit = 3 (maximum possible for a 4-bit function)

The complete S-box tables (source: serpent-tables.h from Ross Anderson's reference implementation):

S₀: { 3, 8,15, 1,10, 6, 5,11,14,13, 4, 2, 7, 0, 9,12 }
S₁: {15,12, 2, 7, 9, 0, 5,10, 1,11,14, 8, 6,13, 3, 4 }
S₂: { 8, 6, 7, 9, 3,12,10,15,13, 1,14, 4, 0,11, 5, 2 }
S₃: { 0,15,11, 8,12, 9, 6, 3,13, 1, 2, 4,10, 7, 5,14 }
S₄: { 1,15, 8, 3,12, 0,11, 6, 2, 5, 4,10, 9,14, 7,13 }
S₅: {15, 5, 2,11, 4,10, 9,12, 0, 3,14, 8,13, 6, 7, 1 }
S₆: { 7, 2,12, 5, 8, 4, 6,11,14, 9, 1,15,13, 3,10, 0 }
S₇: { 1,13,15, 0,14, 8, 2,11, 7, 4,12,10, 9, 3, 5, 6 }

Read each table as: S_n[input] = output. For example, S₀[0] = 3, S₀[1] = 8, S₀[15] = 12.

The inverse S-boxes (used during decryption):

S₀⁻¹: {13, 3,11, 0,10, 6, 5,12, 1,14, 4, 7,15, 9, 8, 2 }
S₁⁻¹: { 5, 8, 2,14,15, 6,12, 3,11, 4, 7, 9, 1,13,10, 0 }
S₂⁻¹: {12, 9,15, 4,11,14, 1, 2, 0, 3, 6,13, 5, 8,10, 7 }
S₃⁻¹: { 0, 9,10, 7,11,14, 6,13, 3, 5,12, 2, 4, 8,15, 1 }
S₄⁻¹: { 5, 0, 8, 3,10, 9, 7,14, 2,12,11, 6, 4,15,13, 1 }
S₅⁻¹: { 8,15, 2, 9, 4, 1,13,14,11, 6, 5, 3, 7,12,10, 0 }
S₆⁻¹: {15,10, 1,13, 5, 3, 6, 0, 4, 9,14, 7, 2,12, 8,11 }
S₇⁻¹: { 3, 0, 6,13, 9,14,15, 8, 5,12,11, 7,10, 1, 4, 2 }

Linear Transform

The linear transform (LT) is applied after each S-box layer in rounds 0-30. It takes four 32-bit words (X₀, X₁, X₂, X₃) and produces four new words. The operation uses only 32-bit rotations and XORs, with no table lookups and no data-dependent branches.

Forward linear transform (encryption):

 1.  X₀ = X₀ <<< 13
 2.  X₂ = X₂ <<< 3
 3.  X₁ = X₁ ⊕ X₀ ⊕ X₂
 4.  X₃ = X₃ ⊕ X₂ ⊕ (X₀ << 3)
 5.  X₁ = X₁ <<< 1
 6.  X₃ = X₃ <<< 7
 7.  X₀ = X₀ ⊕ X₁ ⊕ X₃
 8.  X₂ = X₂ ⊕ X₃ ⊕ (X₁ << 7)
 9.  X₀ = X₀ <<< 5
10.  X₂ = X₂ <<< 22

Where <<< is 32-bit left rotation and << is logical left shift (no rotation).

Inverse linear transform (decryption):

The inverse undoes each step in reverse order, replacing each rotation by r with a rotation by (32 − r):

 1.  X₂ = X₂ <<< 10         (undoes step 10: 32 − 22 = 10)
 2.  X₀ = X₀ <<< 27         (undoes step 9:  32 − 5  = 27)
 3.  X₂ = X₂ ⊕ X₃ ⊕ (X₁ << 7)   (undoes step 8)
 4.  X₀ = X₀ ⊕ X₁ ⊕ X₃          (undoes step 7)
 5.  X₃ = X₃ <<< 25         (undoes step 6:  32 − 7  = 25)
 6.  X₁ = X₁ <<< 31         (undoes step 5:  32 − 1  = 31)
 7.  X₃ = X₃ ⊕ X₂ ⊕ (X₀ << 3)   (undoes step 4)
 8.  X₁ = X₁ ⊕ X₀ ⊕ X₂          (undoes step 3)
 9.  X₂ = X₂ <<< 29         (undoes step 2:  32 − 3  = 29)
10.  X₀ = X₀ <<< 19         (undoes step 1:  32 − 13 = 19)

Round Function

Each of the 32 rounds (indexed 0 through 31) applies three operations in sequence:

Rounds 0-30:

Key mixing: XOR the current 128-bit state with subkey Kᵢ
S-box layer: Apply S_{i mod 8} to all 32 parallel 4-bit groups
Linear transform: Apply the forward LT to the four 32-bit state words

Round 31 (final round):

Key mixing: XOR the state with K₃₁
S-box layer: Apply S_{31 mod 8} = S₇ to all groups
(No linear transform) Final key mixing: XOR the state with K₃₂

The omission of the linear transform in the final round and its replacement with a second key addition is the standard SP-network construction that ensures decryption is a mirror image of encryption.

Encryption

Complete encryption sequence:

Input: 128-bit plaintext P, 256-bit key K
Output: 128-bit ciphertext C

1. Derive subkeys K₀..K₃₂ from K via the key schedule
2. B₀ = P
3. For i = 0 to 30:
     Bᵢ₊₁ = LT(S_{i mod 8}(Bᵢ ⊕ Kᵢ))
4. C = S₇(B₃₁ ⊕ K₃₁) ⊕ K₃₂

In the bitslice description, steps 2-4 operate on the four 32-bit words representing the block. There are no IP or FP permutations in the bitslice form; they are absorbed into the key schedule.

Decryption

Decryption applies every operation in reverse order with inverse functions:

Input: 128-bit ciphertext C, 256-bit key K
Output: 128-bit plaintext P

1. Derive subkeys K₀..K₃₂ from K (same key schedule as encryption)
2. B₃₂ = C ⊕ K₃₂
3. B₃₁ = S₇⁻¹(B₃₂) ⊕ K₃₁
4. For i = 30 downto 0:
     Bᵢ = S_{i mod 8}⁻¹(LT⁻¹(Bᵢ₊₁)) ⊕ Kᵢ
5. P = B₀

The key schedule is identical for encryption and decryption. Subkeys are derived once and used in reverse order during decryption.

Bitslice Implementation

The Bitslice Technique

A conventional block cipher implementation processes each block sequentially, passing one 128-bit value through the S-boxes one step at a time. The S-boxes are implemented as lookup tables indexed by small groups of bits.

A bitslice implementation processes n blocks in parallel by transposing the data representation. Rather than storing one 128-bit block as a unit, each of the 128 bit positions across all n blocks is stored in a single n-bit register. For a 32-bit architecture, n = 32, so 32 blocks are processed simultaneously.

Under this representation, the S-box is no longer a table lookup. Instead, it is implemented as a Boolean circuit of AND, OR, XOR, and NOT operations on the wide registers, computing all four output bits simultaneously for all 32 blocks.

How It Works for Serpent

A 128-bit Serpent block consists of four 32-bit words: (X₀, X₁, X₂, X₃). In the bitslice representation processing 32 blocks simultaneously, these four words become four 32-bit registers where each bit position corresponds to one of the 32 parallel blocks.

The S-box receives four one-bit inputs (one from each of X₀, X₁, X₂, X₃ at the same bit position) and produces four one-bit outputs. The Boolean circuit for each S-box is derived algebraically from the 4-bit to 4-bit lookup table; it computes the same mapping as the table but using only bitwise operations.

For example, S₀ maps a 4-bit input to a 4-bit output using the substitution {3,8,15,1,10,6,5,11,14,13,4,2,7,0,9,12}. The Boolean circuit that computes this mapping for all 32 blocks in parallel is a sequence of roughly 15-20 gate operations (AND, OR, XOR, NOT) on the four input registers.

No Cache-Timing Channels

A table-lookup S-box implementation has a fundamental vulnerability: the memory address accessed depends on the secret data (plaintext XOR key). On a processor with a data cache, access time varies based on whether the cache line containing the table entry is warm or cold. An attacker who can measure timing or shares a cache with the victim can learn information about the key.

The Serpent authors note that the bitslice S-box "does not access different locations in memory for different plaintexts or keys." All 32 gate operations execute unconditionally and access no memory other than the four register operands. There is nothing for a cache-timing adversary to observe.

This property holds at the S-box level. The linear transform (rotations and XORs) is also data-independent by nature. The Serpent core produces no data-dependent memory accesses whatsoever.

Performance

The bitslice technique processes 32 blocks in the time it takes to encrypt one block conventionally. The throughput advantage is therefore up to 32x, modulo the overhead of transposing data in and out of bitslice format.

For bulk encryption (e.g. CTR or CBC mode over large data), the amortized throughput approaches this theoretical maximum. For single-block operations, the overhead of the format conversion dominates and throughput is lower.

The original paper quotes a bitslice implementation at ~11,000 clock cycles per block at 3.5 MHz (~40.7 Kbit/s), versus ~34,000 cycles for a compact table-lookup implementation (~12.8 Kbit/s). Counterintuitively, bitslice is ~3x faster in bulk throughput despite its more complex S-box logic, because the table-lookup version is not vectorized.

Bitslice vs Table-Lookup

Property	Bitslice	Table-Lookup
S-box implementation	Boolean circuit (AND/OR/XOR)	Memory table indexed by secret
Cache-timing side channels	None	Present if cache is shared
Throughput (bulk)	High (32 blocks in parallel)	Low (1 block at a time)
Single-block latency	Higher (format conversion overhead)	Lower
Code size	Larger	Smaller
Data-independent execution	Yes	No

Known Attacks

Differential Cryptanalysis

Source: Original AES submission paper (Anderson/Biham/Knudsen, 1998) Rounds reached: 24 (theoretical characteristic) Type: Classical differential cryptanalysis

The best differential characteristic found by the designers covers 24 rounds with probability ≤ 2⁻²³². Mounting an attack requires more plaintext pairs than exist in the 2¹²⁸ possible values, making the attack physically impossible. The designers applied the characteristic analysis at the single-S-box level (probability ≤ 1/4) and showed that the cascade through multiple rounds reduces the probability exponentially.

Practical threat to 32-round Serpent: None. The attack does not reach full 32 rounds and requires infeasible data quantities even for 24 rounds.

Linear Cryptanalysis

Source: Original AES submission paper (Anderson/Biham/Knudsen, 1998) Rounds reached: 28 (theoretical linear hull) Type: Classical linear cryptanalysis

The best linear hull found by the designers for 28 rounds has a bias of approximately 1/2 ± 2⁻¹²⁰. An attack exploiting this bias would require approximately 2²⁴⁰ known plaintexts, far more than can be encrypted with a single key in any realistic scenario.

Practical threat to 32-round Serpent: None. The attack does not reach full 32 rounds and the data requirement is astronomically large.

Multidimensional Linear Cryptanalysis

Source: "Improving the Algorithm 2 in Multidimensional Linear Cryptanalysis" (Nguyen, Wu, Wang, ACISP 2011) Rounds reached: 12 Type: Multidimensional linear cryptanalysis using FFT/Walsh-Hadamard Transform

This paper improved upon Matsui's Algorithm 2 by using the Fast Walsh-Hadamard Transform (FWHT) to evaluate all 2^k linear approximations simultaneously. Applied to Serpent, it represents the best published linear-family attack as of 2011.

Configuration	Rounds	Data	Time	Memory
Method 1	11	2¹¹⁶ KP	2¹⁰⁷·⁵	2¹⁰⁴
Method 2	12	2¹¹⁸ KP	2²²⁸·⁸	2²²⁸
Method 1	12	2¹¹⁶ KP	2²³⁷·⁵	2¹²¹
Prior best (linear)	11	2¹¹⁸ KP	2¹¹⁴·³	2¹⁰⁸
Prior best (diff-linear)	12	2¹²³·⁵ CP	2²⁴⁹·⁴	2¹²⁸·⁵

KP = Known Plaintexts; CP = Chosen Plaintexts.

The 12-round results, while faster than brute force for reduced-round Serpent, require time complexity of 2²²⁸-2²³⁷, which is effectively infeasible. More importantly, these attacks reach only 12 of the 32 rounds. Full Serpent has 20 more rounds of security margin beyond the reach of these attacks.

Practical threat to 32-round Serpent: None. These are mathematical results on reduced-round variants only.

Biclique Cryptanalysis

Source: "The First Biclique Cryptanalysis of Serpent-256" (de Carvalho & Kowada, unpublished competition entry) Rounds reached: 32 (full) Type: Biclique cryptanalysis (meet-in-the-middle variant)

This is the only published attack that applies to the full 32-round Serpent-256. Biclique cryptanalysis (the same technique applied to AES by Bogdanov et al. in 2011) partitions the cipher into two parts and uses a biclique structure to reduce the effective key search complexity.

Results for full 32-round Serpent-256:

Attack	Time	Data (chosen CTs)	Memory
Dim-4 biclique	2²⁵⁵·²¹	2⁸⁸	2¹⁴·⁸² bytes
Dim-8 biclique	2²⁵⁵·⁴⁵	2⁶⁰	2¹⁸·⁸⁶ bytes

Both attacks are marginally faster than exhaustive key search (2²⁵⁶), by approximately 0.55-0.79 bits. However:

2⁶⁰-2⁸⁸ chosen ciphertexts is physically impossible to obtain under a single key
Time complexities of 2²⁵⁵ are not meaningfully less secure than 2²⁵⁶
These attacks provide no practical cryptanalytic leverage

[!WARNING] The biclique paper contains a formula error in the key schedule description: it states k_i = SBox_{(3-(i mod 33)) mod 32}(w_i), which is incorrect. The correct formula is S_{(3−n) mod 8} for the n-th group of four prekey words (verified against the original AES submission paper and the reference C implementation). The error does not affect the attack results; it is a documentation issue only.

Practical threat to 32-round Serpent-256: None. The attack is 0.55-0.79 bits better than brute force and requires infeasible data and computation.

Related-Key Attacks

Applicability: Not applicable to Serpent

The Serpent key schedule XORs the round index i into every prekey word and applies a different S-box at each of the 33 key schedule groups. This makes the key schedule highly non-linear and key-dependent in a way that defeats standard related-key differential attacks. The original designers explicitly analyzed and dismissed this attack class.

Power Analysis and DPA

Applicability: Implementation-dependent

Differential Power Analysis (DPA) attacks observe the power consumption or electromagnetic emissions of a device during encryption to recover key bits. For Serpent:

Bitslice implementations use every bit of every operand in every operation, causing fast data-key avalanche and reducing DPA correlation
The code execution path does not vary with the data values (no data-dependent branches)
These properties provide DPA resistance, but genuine DPA mitigation requires hardware countermeasures (masking, shuffling) that are outside Serpent's scope

Overall Security Assessment

Serpent-256 with 32 rounds has the following security posture as of 2026:

Attack class	Best result	Threat level
Differential	24 rounds (infeasible data)	None
Linear	28 rounds (infeasible data)	None
Multidim. linear	12 rounds	None
Biclique (full)	2²⁵⁵·²¹ (0.79 bits vs brute force)	None
Related-key	Not applicable	None
Side-channel (bitslice)	No cache-timing in S-boxes	None at algorithm level

Security margin: The best mathematical attack reaches 12 of 32 rounds, leaving a margin of 20 rounds, wider than AES-256's 5-round margin. No practical attack on full Serpent-256 is known. A correct 32-round implementation is cryptographically secure for all foreseeable use cases.

Cross-References

Document	Description
index	Project Documentation index
serpent_audit	security audit results: algorithm correctness verification and side-channel analysis
serpent	TypeScript API for Serpent-256 (SerpentCipher, Seal, raw modes)
asm_serpent	WASM implementation: bitslice S-boxes, key schedule, CTR/CBC in AssemblyScript