08 Long Short Term Neural Networks - PAI-yoonsung/lstm-paper GitHub Wiki
8 Long Short-Term Neural Networks
One solution that addresses the vanishing error problem is a gradient-based method called long short-term memory (LSTM) published by [41], [42], [22] and [23].
λ°°λμ± μλ¬ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν λ°©λ² μ€ νλλ long short-term memory (LSTM) λΌκ³ λΆλ¦¬μ°λ κΈ°μΈκΈ°-κΈ°λ° λ©μλμ΄λ€. μ΄λ [41], [42], [22], [23] μ ν΅ν΄ 곡κ°λμλ€.
LSTM can learn how to bridge minimal time lags of more than 1,000 discrete time steps.
LSTM μ 1,000 λ² μ΄μμ λ³κ°μ νμ μ€νμ μ΅μ μκ° λ μ μ΄λ»κ² μ°κ²°νλμ§ λ°°μΈ μ μλ€. (?)
The solution uses constant error carousels (CECs), which enforce a constant error flow within special cells.
μ΄ ν΄κ²°λ²μ constant error carousels (CECs) μ μ¬μ©νκ³ , μ΄λ μ μ€μ°¨κ° νΉμν μ λ€ μ¬μ΄λ₯Ό νλ₯΄λλ‘ νλ€. (?)
Access to the cells is handled by multiplicative gate units, which learn when to grant access.
μ λ€μκ² μ μνλ κ²μ μ μ νμ©μ μΈμ ν μ§λ₯Ό νμ΅νλ κ³±μ κ²μ΄νΈ μ λλ€μ μν΄ μ‘°μ λλ€
8.1 Constant Error Carousel
Suppose that we have only one unit u with a single connection to itself.
μ°λ¦¬κ° μκΈ° μμ κ³Όμ λ¨μΌ μ°κ²°μ κ°κ³ μλ μ λ u νλλ§ κ°κ³ μλ€κ³ κ°μ ν΄λ³΄μ.
The local error back flow of u at a single time-step Ο follows from Equation 20 and is given by
uμ λ¨μΌ νμμ€ν Ομμμ μ§μ μλ¬ μ νλ¦μ 곡μ 20λ²μ μν΄ μ λλκ³ , μ΄λ λ€μκ³Ό κ°λ€.
From Equations 22 and 23 we see that, in order to ensure a constant error flow through u, we need to have
μ°λ¦¬κ° λ³Έ 곡μ 22λ², 23λ²μΌλ‘λΆν°, uλ₯Ό ν΅ν μ μ€μ°¨ νλ¦μ νμ νκΈ°μν΄ μ°λ¦¬λ λ€μμ κ°μ ΈμΌ νλ€.
and by integration we have
λν μ λΆμ ν΅ν΄, μ°λ¦¬λ λ€μμ κ°μ§κ² λλ€.
From this, we learn that f_u must be linear, and that uβs activation must remain constant over time; i.e.,
μ΄λ₯Ό λ°νμΌλ‘, μ°λ¦¬λ f_u κ° λ°λμ μ νμ΄μ΄μΌ λλ€λ μ κ³Ό, u μ μ‘ν°λ² μ΄μ μ΄ μκ°μ΄ μ§λλ μΌμ νκ² λ¨μμΌλ§ νλ€λ κ²μ λ°°μΈ μ μλ€.
This is ensured by using the identity function f_u = id, and by setting W[u,u] = 1.0.
μ΄λ νλ± ν¨μ f_u = id λ₯Ό μ¬μ©νλ κ²κ³Ό W[u,u] = 1.0 λ₯Ό μ€μ νλ κ²μΌλ‘ ν보λ μ μλ€.
This preservation of error is called the constant error carousel (CEC), and it is the central feature of LSTM, where short-term memory storage is achieved for extended periods of time.
μ΄λ¬ν μλ¬ μλ°©μ constant error carousel (CEC) λΌκ³ λΆλ₯΄κ³ , μ΄λ short-term ν λ©λͺ¨λ¦¬ μ μ₯곡κ°μ΄ μ°μ₯λ μκ°μ κ°μ§ μ μκ²ν΄μ£Όλ LSTM μ μ€μ¬μ μΈ νΉμ§μ΄λ€.
Clearly, we still need to handle the connections from other units to the unit u, and this is where the different components of LSTM networks come into the picture.
λΆλͺ ν, μ°λ¦¬λ μ¬μ ν μ λ uλ‘ ν₯νλ λ€λ₯Έ μ λλ€μ΄ μ°κ²°μ μ‘°μ ν νμκ° μμ§λ§, μ΄λ LSTM λ€νΈμν¬μ λ€λ₯Έ ꡬμ±μμμμ μ²λ¦¬ν κ²μ΄λ€.
8.2 Memory blocks
In the absence of new inputs to the cell, we now know that the CECβs backflow remains constant.
μ μ ν₯νλ μλ‘μ΄ μ λ ₯μ΄ μμ λ, μ°λ¦¬λ CECμ μνλ¦μ΄ μ§μμ μΌ μ μλ€λ μ¬μ€μ μμλ€.
However, as part of a neural network, the CEC is not only connected to itself, but also to other units in the neural network.
κ·Έλ¬λ, μ κ²½λ§μ μΌλΆλ‘μ¨, CEC λ μκΈ° μμ κ³Ό μ°κ²°λ λΏ μλλΌ, μ κ²½λ§ λ΄μ λ€λ₯Έ μ λλ€κ³Όλ μ°κ²°λλ€.
We need to take these additional weighted inputs and outputs into account.
μ΄λ¬ν μΆκ°μ μΈ κ°μ€μΉκ° μ μ©λ μ λ ₯κ³Ό μΆλ ₯μ κ³ λ €ν΄μΌ νλ€.
Incoming connections to neuron u can have conflicting weight update signals, because the same weight is used for storing and ignoring inputs.
μ λ u λ‘ λ€μ΄μ€λ μ°κ²°λ€μ μΆ©λνλ κ°μ€μΉ κ°±μ μ νΈλ€μ κ°μ§ μ μλλ°, μ΄λ κ°μ κ°μ€μΉλ μ μ₯κ³Ό μ λ ₯λ€μ 무μνκΈ° μν΄ μ¬μ©λκΈ° λλ¬Έμ΄λ€.
For weighted output connections from neuron u, the same weights can be used to both retrieve uβs contents and prevent uβs output flow to other neurons in the network.
μ λ u λ‘λΆν°μ κ°μ€μΉκ° μ μ©λ μΆλ ₯ μ°κ²°λ€μ λν΄μ, κ°μ κ°μ€μΉλ€μ u μ λ΄μ©μ 볡ꡬνκ±°λ yμ μΆλ ₯μ μ κ²½λ§ λ΄ λ€λ₯Έ λ΄λ°λ€λ‘ νλ₯΄κ² ν λ μ¬μ©λ μ μλ€.
To address the problem of conflicting weight updates, LSTM extends the CEC with input and output gates connected to the network input layer and to other memory cells.
μΆ©λνλ κ°μ€μΉ κ°±μ λ¬Έμ λ₯Ό λ€λ£¨κΈ° μν΄, LSTM μ λ€νΈμν¬ μ λ ₯ λ μ΄μ΄μ λ€λ₯Έ λ©λͺ¨λ¦¬ μ λ€μ μ°κ²°λ μ λ ₯, μΆλ ₯ κ²μ΄νΈλ€λ‘ CEC λ₯Ό μ°μ₯νλ€.
This results in a more complex LSTM unit, called a memory block; its standard architecture is shown in Figure 11.
μ΄ κ²°κ³Όλ‘ λ§λ€μ΄μ§λ μ‘°κΈ λ 볡μ‘ν LSTM μ λμ λ©λͺ¨λ¦¬ λΈλ‘μ΄λΌκ³ λΆλ₯Έλ€; μ΄κ²μ μΌλ°μ μΈ κ΅¬μ‘°λ Figure 11 μ λμμλ€.
The input gates, which are simple sigmoid threshold units with an activation function range of [0, 1], control the signals from the network to the memory cell by scaling them appropriately; when the gate is closed, activation is close to zero.
νμ±ν ν¨μ λ²μ [0, 1] μ κ°λ κ°λ¨ν μκ·Έλͺ¨μ΄λ μκ³ μ λμΈ μ λ ₯ κ²μ΄νΈλ μ νΈλ₯Ό λ€νΈμν¬μμ λ©λͺ¨λ¦¬μ λ‘ κ°λ μ νΈλ₯Ό μ μ νκ² μ€μΌμΌλ§ν΄μ£Όλ κ²μΌλ‘ 컨νΈλ‘€ν©λλ€; κ²μ΄νΈκ° λ«νλ©΄, μ‘ν°λ² μ΄μ μ 0 μ κ·Όμ νκ² λ©λλ€.
Additionally, these can learn to protect the contents stored in u from disturbance by irrelevant signals.
μΆκ°μ μΌλ‘, μ΄λ€μ u μ μ μ₯λ λ΄μ©λ€μ μκ΄μλ μ νΈμ λ°©ν΄λ‘λΆν° μ§ν€λ κ²μ λ°°μΈ μ μλ€.
The activation of a CEC by the input gate is defined as the cell state.
μ λ ₯ κ²μ΄λμ μν CEC μ νμ±νλ μ μνμ μν΄ μ μλλ€.
The output gates can learn how to control access to the memory cell contents, which protects other memory cells from disturbances originating from u.
μΆλ ₯ κ²μ΄νΈλ€ λ©λͺ¨λ¦¬ μ λ΄μ©μ ν₯ν μ κ·Όμ μ΄λ»κ² 컨νΈλ‘€νλμ§ λ°°μΈ μ μλ€. μ΄λ₯Ό ν΅ν΄, u λ‘λΆν° μμλλ λ°©ν΄λ€λ‘λΆν° λ€λ₯Έ λ©λͺ¨λ¦¬μ λ€μ μ§ν¨λ€.
So we can see that the basic function of multiplicative gate units is to either allow or deny access to constant error flow through the CEC.
μ¦, κ³±μ κ²μ΄νΈ μ λλ€μ κΈ°λ³Έ κΈ°λ₯μ CEC λ₯Ό ν΅ν΄ μμ μλ¬ νλ¦μ λν μ κ·Όμ νλ½νκ±°λ κ±°λΆνλ κ²μ΄λ€.
dictionary
discrete: λΆλ¦¬λ, ꡬλ³λ, κ°λ³μ μΈ enforce: κ°μ νλ€, μ΅μ§λ‘ μν€λ€, μννλ€ constant error: μ μ€μ°¨(?) take ~into account: ~λ₯Ό κ³ λ €νλ€ retrieve: 볡ꡬνλ€ disturbance: λ°©ν΄ irrelevant: μκ΄μλ