无进位乘法和GHASH - emmansun/gmsm GitHub Wiki

概述

参考Reference 1，Page 35 - 39

The PCLMULQDQ + AES-NI combination for AES-GCM
- AES-NI facilitate high performance AES encryption and decryption
- PCLMULQDQ $64 \times 64 \rightarrow 128$ (carry-less)
  - Binary polynomial multiplication; speed up computations in binary fields
- Using it for AES-GCM:
- To use it for GHASH computations: $GF(2^{128})$ multiplication:
  1. Compute $128 \times 128 \rightarrow 256$ via carry-less multiplication (of 64-bit operands)
  2. Reduction: ${256 \rightarrow 128} \ modulo \ {x^{128} + x^7 + x^2 + x + 1}$ (done efficiently via software)
128-bit Carry-less Multiplication using PCLMULQDQ
(Gueron Kounavis, 2009) Multiply $128 \times 128 \rightarrow 256 \ [A_1 : A_0]\cdot[B_1 : B_0]$
- Schoolbook (4 PCLMULQDQ invocations)
  $A_0 \cdot B_0 = [C_1 : C_0], \ A_1 \cdot B_1 = [D_1 : D_0]$
  $A_0 \cdot B_1 = [E_1 : E_0], \ A_1 \cdot B_0 = [F_1 : F_0]$
  $[A_1 : A_0] \cdot [B_1 : B_0] = [D_1:D_0 \oplus E_1 \oplus F_1:C_1 \oplus E_0 \oplus F_0 : C_0]$
- Carry-less Karatsuba (3 PCLMULQDQ invocations)
  $A_1 \cdot B_1 = [C_1 : C_0], \ A_0 \cdot B_0 = [D_1 : D_0]$
  $(A_1 \oplus A_0) \cdot (B_1 \oplus B_0) = [E_1 : E_0]$
  $[A_1 : A_0] \cdot [B_1 : B_0] = [C_1:C_0 \oplus C_1 \oplus D_1 \oplus E_1 : D_1 \oplus C_0 \oplus D_0 \oplus E_0 : D_0]$
A new interpretation to GHASH operations
- GHASH does not use $GF(2^{128})$ computations "as expected"
  - Not in the usual polynomial representation convention
  - The bits inside the 128-bit operands are reflected
  - Actually - it is an operation on a permutation of elements of $GF(2^{128})$
    - $T1 = reflect(A)$
    - $T2 = reflect(B)$
    - $T3 \times T2 \ modulo \ {x^{128} + x^7 + x^2 + x + 1}$ (a $GF(2^{128})$ multiplication)
    - $reflect(T3)$
- It can be proved that this operation is:
  - $A \times B \times x^{-127} \ mod \ x^{128} + x^{127} + x^{126} + x^{121} + 1$
    - A weird Montgomery Multiplication in $GF(2^{128})$ modulo a reversed poly
  - Better written as
    - $A \times (B \times x) \times x^{-128} \ mod \ x^{128} + x^{127} + x^{126} + x^{121} + 1$
Fast reduction modulo $x^{128} + x^{127} + x^{126} + x^{121} + 1$
- Input 256-bit operand $[X_3:X_2:X_1:X_0]$
- $[A_1:A_0] = X_0 \cdot 0xc200000000000000 $
- $[B_1:B_0] = [X_0 \oplus A_1 : X_1 \oplus A_0]$
- $[C_1:C_0] = B_0 \cdot 0xc200000000000000 $
- $[D_1:D_0] = [B_0 \oplus C_1 : B_1 \oplus C_0]$
- Output: $[D_1 \oplus X_3 : D_0 \oplus X_2]$

; Input is in T1:T0
vmodqa     T3, [W]               ; poly
vpclmulqda T2, T3, T0, 0x01
vpshufd    T4, T0, 78
vpxor      T4, T4, T2
vpclmulqda T2, T3, T4, 0x01
vpshufd    T4, T4, 78
vpxor      T4, T4, T2
vpxor      T1, T1, T4            ; result in T1

Aggregated Reduction
- In a Horner form (iterative computation)
  - $Y_i = MM[(X_i \oplus Y_{i-1}), Hx]$ ... everyting $mod \ x^{128} + x^{127} + x^{126} + x^{121} + 1$
- 4-way expanded Horner form (aggregate results to defer the reduction)
  - $MM[X_i , Hx] \oplus MM[X_{i-1} , {(Hx)}^2] \oplus MM[X_{i-2} , {(Hx)}^3] \oplus MM[(X_{i-3} \oplus Y_{i-4}, {(Hx)}^4] $
  - Can be expanded to N > 4 blocks, we use 8 blocks now.
  - Overhead: pre-calculate the powers of $H \cdot x$ (amortized for reasonably long buffer)
  - The gain: reduction deferred to once per "N" blocks