08 Recurrent Neural Networks (RNN) - chanchishing/Introduction-to-Deep-Learning GitHub Wiki

Module: Introduction to Recurrent Neural Networks (RNNs)

1. Understanding Sequential Data

Unlike standard tabular data, sequential data relies on the order of elements to convey meaning.

  • Text: Words depend on previous context. "I love machine learning" is meaningful; "learning machine love I" is gibberish.
  • Time Series: Stock prices, weather, and sensor readings. The current value is a function of historical data.
  • Speech: Audio waveforms where meaning emerges from the sequence of sounds over time.

2. Limitations of Feedforward Networks

Traditional Fully Connected (FC) or Feedforward networks struggle with sequences for two primary reasons:

Problem Description
Fixed Input Size FC networks require a specific number of inputs. They can't natively handle a 5-word sentence and a 50-word sentence using the same architecture.
Lack of Memory Inputs are processed independently. The network "forgets" the first word by the time it reaches the third, losing all contextual flow.

3. The RNN Solution: Persistent State & Computation

The core innovation of an RNN is the Hidden State ($h_t$). This acts as the network's memory, carrying information from one time step to the next.

Unrolling Through Time

While we often see RNNs drawn as a single cell with a loop, it is easier to visualize them unrolled. In this view, we see the same RNN cell applied at each time step, passing the hidden state forward like a baton in a relay race.

Shared Weights: Crucially, the weights ($W$'s) are SHARED across every time step.

  • $W_{xh}$ (Input-to-Hidden): The same matrix is used to process $x_1, x_2, \dots, x_T$.
  • $W_{hh}$ (Hidden-to-Hidden): The same matrix is used to transition from $h_{t-1}$ to $h_t$ at every step.
  • $W_{hy}$ (Hidden-to-Output): The same matrix is used to calculate every $y_t$.

Why do we do this?

  1. Efficiency: It drastically reduces the number of parameters the model needs to learn.
  2. Generalization: It allows the model to look for the same "patterns" regardless of where they appear in the sequence (e.g., a "verb" is a "verb" whether it's the 2nd word or the 20th word).

Mathematical Definition

For every timestep $t$, the cell performs two main calculations:

  1. The Hidden State (Memory update): $$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$

  2. The Raw Output (Logits): $$y_t = W_{hy} h_t + b_y$$

Variable Definitions:

  • $x_t$: Input at the current time.
  • $h_{t-1}$: Hidden state from the previous time step.
  • $W_{hh}, W_{xh}, W_{hy}$: Weight matrices.
  • $\tanh$: Activation function used to regulate information flow (squashing values between -1 and 1).

Computational Flow Diagram

graph LR
    %% This defines the first cell
    subgraph Cell_t_minus_1 [Time Step t-1]
    xt1((x_t-1)) --> RNN1[RNN Cell]
    ht2[h_t-2] --> RNN1
    RNN1 --> yt1((y_t-1))
    end

    %% This connects the first cell to the second via the hidden state
    RNN1 -- "h_t-1 (Memory)" --> RNN2

    %% This defines the second cell
    subgraph Cell_t [Time Step t]
    xt((x_t)) --> RNN2[RNN Cell]
    RNN2 --> yt((y_t))
    RNN2 --> ht[h_t]
    end

    %% Adding some color to make it look like your slides
    style RNN1 fill:#ffcc99,stroke:#333,stroke-width:2px
    style RNN2 fill:#ffcc99,stroke:#333,stroke-width:2px
Loading

4. RNN Architectures & Output Utilization

RNNs are highly flexible. While the "Internal Memory" ($h_t$) remains consistent, we utilize the "External Output" ($y_t$) and apply different activation functions depending on the specific goal of the model.

Mapping Task Types to Outputs

In many cases, the network generates an output $y_t$ at every single time step, but we only "listen" to the ones relevant to our task.

Architecture How $y_t$ is utilized Logic Example
Many-to-One Only the final $y_T$ is used. We ignore all intermediate outputs until the network has seen the entire sequence. Sentiment Analysis (Positive/Negative)
Many-to-Many Every $y_t$ is used. For every input (e.g., a word), we need a corresponding label. Part of Speech (POS) Tagging (Noun, Verb, etc.)
Encoder-Decoder $y_t$ is ignored in the first phase. The "Encoder" builds a summary ($h_T$), then the "Decoder" starts generating outputs. Machine Translation

The Role of Activation Functions on $y_t$

While the hidden state $h_t$ almost always uses tanh for stability, the output $y_t$ uses an activation function to format the data for human (or machine) use:

  1. Softmax: Used for Classification. It turns the raw $y_t$ vector into probabilities that sum to 100%.
    • Equation: $P(i) = \frac{e^{y_i}}{\sum e^{y_j}}$
  2. Sigmoid: Used for Binary Classification. It squashes the output to a value between 0 and 1.
  3. Linear (None): Used for Regression. When predicting a continuous value (like a stock price), we leave $y_t$ as a raw number.

5. Practical Implementation: $t$ and $T$

In real-world applications (e.g., training on Newspaper Articles), sequences are rarely the same length. We must standardize them for the model while maintaining the sequential logic.

Defining Time and Horizon

  • $t$ (The Time Step): Represents the specific index of a token (word) in the sequence. If you are at $t=5$, the RNN is processing the 5th word.
  • $T$ (The Horizon): The total length of the specific sequence (e.g., the total word count of an article).

Handling Variable Lengths

Because modern hardware (GPUs) requires data to be in uniform "batches," we use the following techniques:

  1. Maximum Sequence Length ($T_{max}$): We set a limit (e.g., 200 words).
  2. Padding: If an article is shorter than $T_{max}$, we add <PAD> tokens (zeros) to the end. The RNN is usually configured to ignore these during loss calculation.
  3. Truncation: If an article is longer than $T_{max}$, we cut it off to fit the limit.

The Importance of the Hidden State Reset

In practice, when you finish processing one article and move to the next, you must reset the hidden state ($h_0$) to zeros.

Why? If you don't reset $h_t$, the "memory" from the end of a Sports article will bleed into the beginning of a Politics article, causing the model to see a false relationship between them.

Scenario $T$ (Actual) $t$ (Processed) Result
Short Headline 12 $1 \dots 200$ Padded with 188 zeros
Long Editorial 450 $1 \dots 200$ Truncated at word 200

6. Training Challenges: Gradients and Memory

While the RNN architecture is theoretically sound for sequences, training them in practice is difficult due to how information is updated over many time steps.

Backpropagation Through Time (BPTT)

To train an RNN, we use a specialized version of backpropagation called BPTT

  • The Forward Pass: Information flows from $t=1$ to $T$ to compute the loss.
  • The Backward Pass: Gradients flow backward from the final loss through every previous time step to update the shared weights ($W_{hh}, W_{xh}, W_{hy}$).
  • The Core Issue: Because the same $W_{hh}$ is multiplied repeatedly at every step, the gradients can become extremely unstable as the sequence length ($T$) increases.

The Vanishing Gradient Problem

This is the most common reason basic RNNs fail on long sequences.

  • The Mathematics: The gradient at early time steps depends on a chain of multiplications of $W_{hh}$ and the derivative of the $\tanh$ activation.
  • The Result: If these values are small (less than 1), the gradient shrinks exponentially: $\text{Gradient} \propto (\text{small value})^T \to 0$.
  • Consequence: The model "stops learning" from early inputs because the update signal never reaches them.

The Exploding Gradient Problem

The exact opposite happens if weights or gradient factors are large.

  • The Result: If factors are greater than 1, the gradient grows exponentially.
  • Consequence: Weights become so large that they overflow, often resulting in NaN (Not a Number) values, which crashes the training process.
  • The Solution: Gradient Clipping. If the gradient norm exceeds a specific threshold, we manually scale it down:   $$g \leftarrow \min\left(1, \frac{\text{threshold}}{|g|}\right) \cdot g$$

Consequence: Short-Term Memory

Due to the vanishing gradient, basic RNNs suffer from a Short-Term Memory limitation.

  • Long-Range Dependencies: Basic RNNs can typically only learn patterns over 10–20 time steps.
  • Context Loss: Information from the beginning of a long sequence is "overwritten" or lost by the time the model reaches the end.
  • Examples of Failure:
    • Grammar: "The cat, which sat on the mat... was happy." (The RNN forgets "cat" is singular by the time it reaches the verb).
    • Logic: "I grew up in France... I speak French." (The RNN forgets the country context across a long paragraph).

Note: The Hidden State $h_t$ is a "Fixed-size vector" (e.g., 128 dimensions). Trying to cram the entire history of a long article into one small vector is like trying to summarize a whole book into a single sentence—eventually, the early details are forced out.

To address these fundamental limitations, the Long Short-Term Memory (LSTM) architecture was developed. It uses specialized "gates" to allow gradients to flow through long sequences without vanishing.

⚠️ **GitHub.com Fallback** ⚠️