09 Training LSTM RNNs the Hybrid Learning Approach - PAI-yoonsung/lstm-paper GitHub Wiki
9 Training LSTM-RNNs - the Hybrid Learning Approach
In order to preserve the CEC in LSTM memory block cells, the original formulation of LSTM used a combination of two learning algorithms: BPTT to train network components located after cells, and RTRL to train network components located before and including cells.
LSTM λ©λͺ¨λ¦¬ λΈλ‘ μ μμ CEC λ₯Ό 보쑴νκΈ° μν΄, LSTM μ μλ³Έ 곡μμ λ κ°μ§ νμ΅ μκ³ λ¦¬μ¦μ μ‘°ν©μ μ¬μ©νμλ€: μ λ€ λ€μ μμΉν λ€νΈμν¬ κ΅¬μ± μμλ€μ νμ΅μν€κΈ° μν BPTT μ, μ λ€ μμ μμΉν λ€νΈμν¬ κ΅¬μ± μμλ€κ³Ό μ λ€ μ체λ₯Ό νμ΅μν€κΈ° μν RTRL μ΄λ€.
The latter units work with RTRL because there are some partial derivatives (related to the state of the cell) that need to be computed during every step, no matter if a target value is given or not at that step.
νλ°© μ λλ€μ RTRL λ‘ μλνκ² λλλ°, κ·Έ μ΄μ λ (μ λ€μ μνμ κ΄λ ¨λ)νΈλ―ΈλΆμ΄ μκΈ° λλ¬Έμ΄λ€. μ΄λ€μ νμ¬ μ€νμμ λͺ©μ κ°μ΄ μ£Όμ΄μ§λ μ£Όμ΄μ§μ§ μλ λ§€ μ€νλ§λ€ κ³μ°λμ΄μΌνλ€.
For now, we only allow the gradient of the cell to be propagated through time, truncating the rest of the gradients for the other recurrent connections.
νμ¬λ, μ€μ§ μ λ€μ κ²½μ¬λ§μ μκ°μ λ°λΌ μ νλλλ‘ νλ½νκ³ , λλ¨Έμ§ λ€λ₯Έ μν μ°κ²°μ λν κ²½μ¬λ€μ μλΌλΈλ€.
We define discrete time steps in the form Ο = 1, 2, 3, .... Each step has a forward pass and a backward pass; in the forward pass the output/activation of all units are calculated, whereas in the backward pass, the calculation of the error signals for all weights is performed.
μ°λ¦¬λ μ΄μ° νμ μ€νλ€μ Ο = 1, 2, 3, .... μ ννλ‘ μ μνκ³ , κ° μ€νμ μλ°©ν₯ ν΅κ³Όμ μλ°©ν₯ ν΅κ³Όλ₯Ό κ°λλ€; μλ°©ν₯ ν΅κ³Όμμλ λͺ¨λ μ λλ€μ μΆλ ₯/νμ±ν κ° κ³μ°λκ³ , λ°λ©΄μ μλ°©ν₯ ν΅κ³Όμμλ λͺ¨λ κ°μ€μΉλ€μ λν μλ¬ μ νΈμ κ³μ°μ΄ μ΄λ£¨μ΄μ§λ€.
9.1 The Forward Pass
Let M be the set of memory blocks. Let m_c be the c-th memory cell in the memory block m, and W[u,v] be a weight connecting unit u to unit v.
M μ λ©λͺ¨λ¦¬ λΈλ‘λ€μ μ§ν©μ΄λΌ νμ. μ΄λ, m_c λ λ©λͺ¨λ¦¬ λΈλ‘ m μμ c λ²μ§Έ λ©λͺ¨λ¦¬ μ μ΄κ³ , W[u,v] λ μ λ u μ μ λ v λ₯Ό μ°κ²°νλ κ°μ€μΉκ° λλ€.
In the original formulation of LSTM, each memory block m is associated with one input gate inm and one output gate out_m.
LSTM μ μλ³Έ 곡μμμλ, κ° λ©λͺ¨λ¦¬ λΈλ‘ m μ μ°κ΄λλ€.
The internal state of a memory cell m_c at time Ο + 1 is updated according to its state s_m_c (Ο ) and according to the weighted input z_m_c (Ο + 1) multiplied by the activation of the input gate y_in_m(Ο + 1).
νμ Ο + 1 μμ λ©λͺ¨λ¦¬ μ μ μ΄κΈ° μν m_c λ μν s_m_c (Ο )μ μ λ ₯ κ²μ΄νΈ y_in_m(Ο + 1) μ νμ±νμ μν΄ κ³±ν΄μ§ κ°μ€μΉκ° μ μ©λ μ λ ₯ z_m_c (Ο + 1)λ₯Ό λ°λ₯Έλ€.
Then, we use the activation of the output gate z_out_m(Ο + 1) to calculate the activation of the cell y_m_c (Ο + 1).
κ·Έλ¦¬κ³ λλ©΄, μΆλ ₯ κ²μ΄νΈ z_out_m(Ο + 1) μ νμ±νλ₯Ό μ¬μ©νμ¬ μ y_m_c (Ο + 1) μ νμ±νλ₯Ό κ³μ°νκ² λλ€.
The activation y_in_m of the input gate in_m is computed as
μ λ ₯ κ²μ΄λ in_m μ νμ±ν y_in_m λ μμ 24λ² μκ³Ό κ°μ΄ κ³μ°λλ€.
Figure 10: A standard LSTM memory block. The block contains (at least) one cell with a recurrent self-connection (CEC) and weight of β1β. The state of the cell is denoted as sc. Read and write access is regulated by the input gate, yin, and the output gate, yout. The internal cell state is calculated by multiplying the result of the squashed input, g, by the result of the input gate, yin, and then adding the state of the last time step, sc(t β 1). Finally, the cell output is calculated by multiplying the cell state, sc, by the activation of the output gate, y_out.
κ·Έλ¦Ό 10: μΌλ°μ μΈ LSTM λ©λͺ¨λ¦¬ λΈλ‘. λΈλ‘μ μν-μκΈ°μ°κ²° (CEC) κ³Ό κ°μ€μΉ 1 μ κ°λ (μ΅μ) νλμ μ μν¬ν¨νλ€. μ μ μνλ s_c λ‘ νκΈ°νλ€. Read, Write μ κ·Όμ μ λ ₯ κ²μ΄νΈ y_in μ μΆλ ₯ κ²μ΄νΈ y_out μ μν΄ ν΅μ λλ€. λ΄λΆ μ μνλ μ λ ₯κ²μ΄νΈ y_in μ κ²°κ³Όλ‘ μΈν΄ μ§λλ¦°(?) μ λ ₯μΈ g λ₯Ό κ³±ν΄μ£Όλ κ²κ³Ό μ§λ νμμ€νμ μν s_c(t-1) μ λν΄μ£Όλ κ²μΌλ‘ κ³μ°λλ€. λ§μ§λ§μΌλ‘, μ μ μΆλ ₯μ μΆλ ₯ κ²μ΄νΈ y_out μ νμ±νμ μν΄ μ μν s_c λ₯Ό κ³±ν΄μ£Όλ κ²μΌλ‘ κ³μ°λλ€.
κ·Έλ¦Ό 11: μΌλ°μ μΈ LSTM λ©λͺ¨λ¦¬ λΈλ‘. λΈλ‘μ μν-μκΈ°μ°κ²° (CEC) κ³Ό κ°μ€μΉ 1 μ κ°λ (μ΅μ) νλμ μ μν¬ν¨νλ€. μ μ μνλ s_c λ‘ νκΈ°νλ€. Read, Write μ κ·Όμ μ λ ₯ κ²μ΄νΈ y_in μ μΆλ ₯ κ²μ΄νΈ y_out μ μν΄ ν΅μ λλ€. λ΄λΆ μ μνλ μ λ ₯κ²μ΄νΈ y_in μ κ²°κ³Όλ‘ μΈν΄ μ§λλ¦°(?) μ λ ₯μΈ g(x) λ₯Ό κ³±ν΄μ£Όλ κ²κ³Ό νμ¬ νμμ€νμ μν s_m_c(t) λ₯Ό λ€μ s_m_c(t+1) μ λν΄μ£Όλ κ²μΌλ‘ κ³μ°λλ€. λ§μ§λ§μΌλ‘, μ μ μΆλ ₯μ μΆλ ₯ κ²μ΄νΈμ νμ±νμ μν΄ μ μνλ₯Ό κ³±ν΄μ£Όλ κ²μΌλ‘ κ³μ°λλ€.
dictionary
partial derivatives: νΈλ―ΈλΆ
truncating: μλ₯΄κΈ°, μ λ¨
regulated: κ·μ