09 Long Short Term Memory (LSMM) Recurrent Network - chanchishing/Introduction-to-Deep-Learning GitHub Wiki

Module: Long Short-Term Memory (LSTM)

1. Overview: The Motivation for LSTMs

Standard RNNs suffer from Vanishing Gradients. When backpropagating through many timesteps, the repeated multiplication of small weights causes the gradient to shrink to zero, meaning the model "forgets" information from the beginning of the sequence.

LSTMs mitigate this by using a Cell State ($C_t$) and a series of Gates to carefully regulate the flow of information.

2. The LSTM Cell Architecture

The LSTM's power lies in its ability to selectively add or remove information from the Cell State ($C_t$), which acts as a long-term memory conveyor belt.
Refer to the LSTM Cell diagram below , think of the Cell State ($C_t$) as a long-term conveyor belt (Long-Term Memory) and the Hidden State ($h_t$) as the specific information being put to work for the current prediction (Short-Term Memory).

2.1 The LSTM Cell Diagam

The diagram below represents a single LSTM cell. Unlike a standard RNN, which has a single tanh layer, this cell contains four interacting components.

graph TD
    %% Inputs
    H_prev[h_t-1] --> CONCAT["[ , ] Concatenate"]
    X_in[x_t] --> CONCAT
    C_prev[C_t-1] -- "Cell State (Long-term)" --> MULT1((x))

    %% Gate Calculations
    CONCAT -.-> F_GATE[Forget Gate σ]
    CONCAT -.-> I_GATE[Input Gate σ]
    CONCAT -.-> C_CAND[Candidate tanh]
    CONCAT -.-> O_GATE[Output Gate σ]

    %% Main Flow Logic
    F_GATE --> MULT1
    I_GATE --> MULT2((x))
    C_CAND --> MULT2
    MULT1 --> ADD1((+))
    MULT2 --> ADD1
    
    %% Cell State to Output
    ADD1 --> C_curr[C_t]
    ADD1 --> TANH_OUT[tanh]
    TANH_OUT --> MULT3((x))
    O_GATE --> MULT3
    MULT3 --> H_curr[h_t]

    %% Styling
    style CONCAT fill:#f9f,stroke:#333,stroke-width:2px
    style MULT1 fill:#333,color:#fff
    style MULT2 fill:#333,color:#fff
    style MULT3 fill:#333,color:#fff
    style ADD1 fill:#333,color:#fff

Operations & Symbols

[ , ] (Concatenation Layer): This merges the previous hidden state $h_{t-1}$ and the current input $x_t$ into a single vector. This combined vector is the "raw material" fed into all four gates.
x (Pointwise Multiplication / Gating): This represents the "Gating" action.
- By multiplying the data by a value from the Sigmoid layer ($0$ to $1$), the cell decides how much information to let through.
+ (Pointwise Addition): This is the linear update of the Cell State.
- Key Insight: In standard RNNs, we multiply by weights at every step, which causes gradients to vanish. In LSTMs, we add the new info, creating a "highway" to preserve the gradient travelling back in time.

Dashed/Connecting Lines: These show the distribution of the $[h_{t-1}, x_t]$ signals to the four internal LSTM cell components (Forget, Input, Candidate, and Output).

2.2 LSTM Gates in a Cell

A. Forget Gate ($f_t$)

Function: Decides what information is no longer useful and should be discarded.

Formula: $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$
Logic: Outputs a vector of values between $0$ (forget completely) and $1$ (keep entirely).
Example: When a new subject appears in a sentence, the gate may "forget" the gender of the previous subject.

B. Input Gate ($i_t$) & Candidate Values ($\tilde{C}_t$)

Function: Decides what new information is worth storing in the long-term memory.

Input Gate ($i_t$): $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$ — Determines which dimensions of the cell state to update.
Candidate ($\tilde{C}_t$): $\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$ — Creates a vector of new values that could be added.

Update Calculation: The actual new info added is $i_t \odot \tilde{C}_t$.

C. Updating Cell State ($C_t$)

Function: Merges the old memory with the new filtered input.

Formula: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$
Key Insight: Because this update uses addition rather than multiplication only, the gradient is modulated by the forget gate values at each timestep, the additive structure provides a path for the gradients. This is the "secret sauce" of LSTMs.

D. Output Gate ($o_t$)

Function: Decides what parts of the Cell State ($C_t$) to export to the Hidden State ($h_t$).

Gate Formula: $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$
Hidden State Formula: $h_t = o_t \odot \tanh(C_t)$
Usage: $h_t$ is the "visible" state used for predictions and passed to the next cell.

Mathematical Note: Tensors and Shapes

The variables above represent different mathematical objects:

$x_t$: Input vector of size $(d \times 1)$.
$h_t, C_t$: Hidden and Cell state vectors of size $(h \times 1)$.
$W_f, W_i, W_C, W_o$: Learnable weight Matrices of size $(h \times [h + d])$.
$b_f, b_i, b_C, b_o$: Bias vectors of size $(h \times 1)$.

The Flow: $\text{Matrix } (h \times [h+d]) \cdot \text{Vector } ([h+d] \times 1) = \text{Gate Vector } (h \times 1)$

E. Summary Table: LSTM Equations

Component	Equation	Activation	Purpose
Forget Gate	$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$	Sigmoid	Erase irrelevant history
Input Gate	$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$	Sigmoid	Select which new info to keep
Candidate	$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$	Tanh	Generate potential new memories
Cell State	$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$	Linear	The "Conveyor Belt" (Long-term)
Output Gate	$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$	Sigmoid	Filter for the hidden state
Hidden State	$h_t = o_t \odot \tanh(C_t)$	Tanh	Current context / Prediction

[!NOTE] Tutor Deep Dive: Sigmoid vs. Tanh

Sigmoid is used for gates because it outputs $[0, 1]$, acting like a binary switch.

Tanh is used for values/candidates because it outputs $[-1, 1]$, which is better for centered data distribution and allows the cell state to increase or decrease.

3. Deep (Stacked) LSTMs

To increase the representational power of the model, we can vertically stack multiple LSTM cells on top of each other.

Key Mechanics:

Vertical Flow: The hidden state $h_t^{(l-1)}$ from layer $l-1$ serves as the input $x_t^{(l)}$ for layer $l$.
State Isolation: Each layer maintains its own unique Cell State ($C_t$) and Hidden State ($h_t$). They do not share memory.
Final Prediction: Only the hidden state of the top-most layer is typically used to calculate the output $\hat{y}_t$.

graph TD
    subgraph Layer2 [Layer 2: High-Level Features]
    L2_C1[h_t-1] --> L2_Cell[LSTM Cell t]
    L2_Cell --> L2_C2[h_t]
    end

    subgraph Layer1 [Layer 1: Low-Level Features]
    L1_C1[h_t-1] --> L1_Cell[LSTM Cell t]
    L1_Cell --> L1_C2[h_t]
    end

    %% Vertical Connections
    Input[Input x_t] --> L1_Cell
    L1_Cell -- "h_t becomes input" --> L2_Cell
    L2_Cell -- "Final Hidden State" --> Prediction[Softmax / Output y_t]

    %% Styling
    style L2_Cell fill:#fff59d,stroke:#333
    style L1_Cell fill:#80deea,stroke:#333
    style Prediction fill:#f48fb1,stroke:#333

Mathematical Mapping of Stacked Layers

When stacking LSTMs, we use superscript $(l)$ to track the vertical position:

Input to Layer 1: $x_t^{(1)}$ is the raw data (e.g., word embedding).
Input to Layer $l$: $x_t^{(l)} = h_t^{(l-1)}$. The hidden state of the lower layer becomes the input for the layer above.
Hidden State of Layer $l$: $h_t^{(l)}$ is calculated using the gates within that specific layer.
Final Output: Usually derived from the last layer $L$: $\hat{y}_t = f(h_t^{(L)})$.

[!CAUTION] Common Misconception: Each layer has its own unique weights ($W^{(l)}$) and biases ($b^{(l)}$). Layer 1 and Layer 2 do not share the same parameters, even though they share the same LSTM logic.

[!NOTE] Tutor Insight: Stacking LSTMs allows the model to build a hierarchy of features. However, adding too many layers can make the model difficult to train and prone to overfitting. Most production models use 2 to 4 layers.

3.1 Summary of Prediction Logic

The LSTM cell itself does not output a probability or a class. Instead, it provides a "context-rich" hidden state vector ($h_t$). To get a prediction ($\hat{y}_t$), we pass this vector through a standard linear transformation.

The Prediction Equation

For a classification task, the final output is calculated as:

$$\hat{y}_t = \text{Softmax}(W_y h_t^{(L)} + b_y)$$

Where:

$h_t^{(L)}$: The hidden state from the top-most layer ($L$) at the current timestep.
$W_y$: A learnable Weight matrix of size $(\text{output}_{dim} \times \text{hidden}_{dim})$.
$b_y$: A learnable Bias vector of size $(\text{output}_{dim} \times 1)$.
Softmax: An activation function that turns the "logits" into a probability distribution (summing to 1).

Workflow Diagram

graph LR
    C[Cell State C_t] -- "Filtered by Output Gate" --> H[Hidden State h_t]
    H -- "Matrix Mul (W_y)" --> Z[Logits / Score]
    Z -- "Softmax / Sigmoid" --> Y[Prediction y_hat]
    
    style H fill:#fff59d,stroke:#333
    style Y fill:#f48fb1,stroke:#333

3.2 The Vertical vs. Horizontal Flow

When stacking LSTMs, it is vital to distinguish between how the two states ($h$ and $C$) travel.

$C_t$ (Horizontal Only): The Cell State remains inside its original layer. It only moves from $t-1 \to t$.
$h_t$ (Horizontal & Vertical): The Hidden State moves forward in time ($t \to t+1$) and upward through the stack (Layer $l \to l+1$).

Deep LSTM State Flow Diagram

graph TD
    %% Timestep t-1
    subgraph T_prev [Timestep t-1]
        L2_Ct_prev[Layer 2 of C_t-1]
        L1_Ct_prev[Layer 1 of C_t-1]
    end

    %% Timestep t
    subgraph T_curr [Timestep t]
        direction TB
        L2_Cell[LSTM Cell Layer 2]
        L1_Cell[LSTM Cell Layer 1]
    end

    %% Horizontal Flow (Internal Memory)
    L1_Ct_prev -- "Internal" --> L1_Cell
    L2_Ct_prev -- "Internal" --> L2_Cell

    %% Vertical Flow (Inter-layer Communication)
    Input[Input x_t] --> L1_Cell
    L1_Cell -- "h_t ^1" --> L2_Cell
    L2_Cell -- "h_t ^2" --> Output[Prediction y_t]

    %% Styling
    style L1_Cell fill:#80deea,stroke:#333
    style L2_Cell fill:#fff59d,stroke:#333
    style L1_Ct_prev fill:none,stroke-dasharray: 5 5
    style L2_Ct_prev fill:none,stroke-dasharray: 5 5

$C_t^{(l)}$ stays in Layer $l$.
$h_t^{(l)}$ is passed to Layer $l+1$.
If you have a 3-layer LSTM, you actually have 3 separate "conveyor belts" running in parallel, one for each level of abstraction.

3.3 Visualizing the 2D Grid (Time vs. Depth)

To understand stacking, visualize the model as a grid where information flows in two directions:

Horizontally: Through Time (Timestep $t-1 \to t \to t+1$).
Vertically: Through Layers (Layer $l-1 \to l \to l+1$).

graph TD
    subgraph Timestep_T [Timestep t]
        direction TB
        L2_Cell[Layer 2 Cell]
        L1_Cell[Layer 1 Cell]
    end

    subgraph Timestep_T_Plus [Timestep t+1]
        direction TB
        L2_Next[Layer 2 Cell]
        L1_Next[Layer 1 Cell]
    end

    %% Horizontal Flow (Time)
    L1_Cell -- "h_t, C_t" --> L1_Next
    L2_Cell -- "h_t, C_t" --> L2_Next

    %% Vertical Flow (Stacking)
    Input[Input x_t] --> L1_Cell
    L1_Cell -- "h_t" --> L2_Cell
    L2_Cell --> Prediction[y_t]

    %% Axis Labels
    style L1_Cell fill:#80deea
    style L2_Cell fill:#fff59d

Direction	Movement	Data Shared	Purpose
Horizontal	$t \to t+1$	$h_t$ and $C_t$	Sequence memory (Context)
Vertical	$Layer_l \to Layer_{l+1}$	$h_t$ only	Feature extraction (Hierarchy)