08 Recurrent Neural Networks (RNN) - chanchishing/Introduction-to-Deep-Learning GitHub Wiki
Unlike standard tabular data, sequential data relies on the order of elements to convey meaning.
- Text: Words depend on previous context. "I love machine learning" is meaningful; "learning machine love I" is gibberish.
- Time Series: Stock prices, weather, and sensor readings. The current value is a function of historical data.
- Speech: Audio waveforms where meaning emerges from the sequence of sounds over time.
Traditional Fully Connected (FC) or Feedforward networks struggle with sequences for two primary reasons:
| Problem | Description |
|---|---|
| Fixed Input Size | FC networks require a specific number of inputs. They can't natively handle a 5-word sentence and a 50-word sentence using the same architecture. |
| Lack of Memory | Inputs are processed independently. The network "forgets" the first word by the time it reaches the third, losing all contextual flow. |
The core innovation of an RNN is the Hidden State (
While we often see RNNs drawn as a single cell with a loop, it is easier to visualize them unrolled. In this view, we see the same RNN cell applied at each time step, passing the hidden state forward like a baton in a relay race.
Shared Weights: Crucially, the weights (
-
$W_{xh}$ (Input-to-Hidden): The same matrix is used to process$x_1, x_2, \dots, x_T$ . -
$W_{hh}$ (Hidden-to-Hidden): The same matrix is used to transition from$h_{t-1}$ to$h_t$ at every step. -
$W_{hy}$ (Hidden-to-Output): The same matrix is used to calculate every$y_t$ .
Why do we do this?
- Efficiency: It drastically reduces the number of parameters the model needs to learn.
- Generalization: It allows the model to look for the same "patterns" regardless of where they appear in the sequence (e.g., a "verb" is a "verb" whether it's the 2nd word or the 20th word).
For every timestep
-
The Hidden State (Memory update):
$$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$ -
The Raw Output (Logits):
$$y_t = W_{hy} h_t + b_y$$
Variable Definitions:
-
$x_t$ : Input at the current time. -
$h_{t-1}$ : Hidden state from the previous time step. -
$W_{hh}, W_{xh}, W_{hy}$ : Weight matrices. -
$\tanh$ : Activation function used to regulate information flow (squashing values between -1 and 1).
graph LR
%% This defines the first cell
subgraph Cell_t_minus_1 [Time Step t-1]
xt1((x_t-1)) --> RNN1[RNN Cell]
ht2[h_t-2] --> RNN1
RNN1 --> yt1((y_t-1))
end
%% This connects the first cell to the second via the hidden state
RNN1 -- "h_t-1 (Memory)" --> RNN2
%% This defines the second cell
subgraph Cell_t [Time Step t]
xt((x_t)) --> RNN2[RNN Cell]
RNN2 --> yt((y_t))
RNN2 --> ht[h_t]
end
%% Adding some color to make it look like your slides
style RNN1 fill:#ffcc99,stroke:#333,stroke-width:2px
style RNN2 fill:#ffcc99,stroke:#333,stroke-width:2px
RNNs are highly flexible. While the "Internal Memory" (
In many cases, the network generates an output
| Architecture | How |
Logic | Example |
|---|---|---|---|
| Many-to-One | Only the final |
We ignore all intermediate outputs until the network has seen the entire sequence. | Sentiment Analysis (Positive/Negative) |
| Many-to-Many |
Every |
For every input (e.g., a word), we need a corresponding label. | Part of Speech (POS) Tagging (Noun, Verb, etc.) |
| Encoder-Decoder |
|
The "Encoder" builds a summary ( |
Machine Translation |
While the hidden state tanh for stability, the output
-
Softmax: Used for Classification. It turns the raw
$y_t$ vector into probabilities that sum to 100%.-
Equation:
$P(i) = \frac{e^{y_i}}{\sum e^{y_j}}$
-
Equation:
- Sigmoid: Used for Binary Classification. It squashes the output to a value between 0 and 1.
-
Linear (None): Used for Regression. When predicting a continuous value (like a stock price), we leave
$y_t$ as a raw number.
In real-world applications (e.g., training on Newspaper Articles), sequences are rarely the same length. We must standardize them for the model while maintaining the sequential logic.
-
$t$ (The Time Step): Represents the specific index of a token (word) in the sequence. If you are at$t=5$ , the RNN is processing the 5th word. -
$T$ (The Horizon): The total length of the specific sequence (e.g., the total word count of an article).
Because modern hardware (GPUs) requires data to be in uniform "batches," we use the following techniques:
-
Maximum Sequence Length (
$T_{max}$ ): We set a limit (e.g., 200 words). -
Padding: If an article is shorter than
$T_{max}$ , we add<PAD>tokens (zeros) to the end. The RNN is usually configured to ignore these during loss calculation. -
Truncation: If an article is longer than
$T_{max}$ , we cut it off to fit the limit.
The Importance of the Hidden State Reset
In practice, when you finish processing one article and move to the next, you must reset the hidden state (
Why? If you don't reset
$h_t$ , the "memory" from the end of a Sports article will bleed into the beginning of a Politics article, causing the model to see a false relationship between them.
| Scenario |
|
|
Result |
|---|---|---|---|
| Short Headline | 12 | Padded with 188 zeros | |
| Long Editorial | 450 | Truncated at word 200 |
While the RNN architecture is theoretically sound for sequences, training them in practice is difficult due to how information is updated over many time steps.
To train an RNN, we use a specialized version of backpropagation called BPTT.
-
The Forward Pass: Information flows from
$t=1$ to$T$ to compute the loss. -
The Backward Pass: Gradients flow backward from the final loss through every previous time step to update the shared weights (
$W_{hh}, W_{xh}, W_{hy}$ ). -
The Core Issue: Because the same
$W_{hh}$ is multiplied repeatedly at every step, the gradients can become extremely unstable as the sequence length ($T$ ) increases.
This is the most common reason basic RNNs fail on long sequences.
-
The Mathematics: The gradient at early time steps depends on a chain of multiplications of
$W_{hh}$ and the derivative of the$\tanh$ activation. -
The Result: If these values are small (less than 1), the gradient shrinks exponentially:
$\text{Gradient} \propto (\text{small value})^T \to 0$ . - Consequence: The model "stops learning" from early inputs because the update signal never reaches them.
The exact opposite happens if weights or gradient factors are large.
- The Result: If factors are greater than 1, the gradient grows exponentially.
-
Consequence: Weights become so large that they overflow, often resulting in
NaN(Not a Number) values, which crashes the training process. -
The Solution: Gradient Clipping. If the gradient norm exceeds a specific threshold, we manually scale it down:
$$g \leftarrow \min\left(1, \frac{\text{threshold}}{|g|}\right) \cdot g$$
Due to the vanishing gradient, basic RNNs suffer from a Short-Term Memory limitation.
- Long-Range Dependencies: Basic RNNs can typically only learn patterns over 10–20 time steps.
- Context Loss: Information from the beginning of a long sequence is "overwritten" or lost by the time the model reaches the end.
-
Examples of Failure:
- Grammar: "The cat, which sat on the mat... was happy." (The RNN forgets "cat" is singular by the time it reaches the verb).
- Logic: "I grew up in France... I speak French." (The RNN forgets the country context across a long paragraph).
Note: The Hidden State
$h_t$ is a "Fixed-size vector" (e.g., 128 dimensions). Trying to cram the entire history of a long article into one small vector is like trying to summarize a whole book into a single sentence—eventually, the early details are forced out.
To address these fundamental limitations, the Long Short-Term Memory (LSTM) architecture was developed. It uses specialized "gates" to allow gradients to flow through long sequences without vanishing.