07 Solving the Vanishing Error Problem - PAI-yoonsung/lstm-paper GitHub Wiki
Standard RNN cannot bridge more than 5β10 time steps ([22]).
ν΅μμ RNN μ 5-10 λ²μ νμ μ€νλ³΄λ€ λ λ§μ΄ μ°κ²°λ μ μλ€.
This is due to that back-propagated error signals tend to either grow or shrink with every time step.
μ΄κ²μ μμ ν μλ¬ μ νΈλ 맀 νμ μ€νλ§λ€ 컀μ§κ±°λ μμμ§λ κ²½ν₯μ΄ μκΈ° λλ¬Έμ΄λ€.
Over many time steps the error therefore typically blows-up or vanishes ([5, 42]).
κ²°κ³Όμ μΌλ‘, λ§μ νμ μ€νμ κ±°λνκ² λλ©΄, μΌλ°μ μΌλ‘ μλ¬κ° νλ°νκ±°λ μ¬λΌμ§κ² λλ€([5, 42]).
Blown-up error signals lead straight to oscillating weights, whereas with a vanishing error, learning takes an unacceptable amount of time, or does not work at all.
νλ°ν μλ¬ μ νΈλ μ§λνλ κ°μ€μΉλ‘ μ§κ²°λλ λ°λ©΄, μ¬λΌμ§λ μ€λ₯λ, νμ΅μ κ³Όνκ² μ€λ μκ°μ΄ κ±Έλ¦¬κ² λκ±°λ, μλνμ§ μκ² λλ€.
ν΅μμ μμ ν μκ³ λ¦¬μ¦μ μν΄ μ΄λ»κ² κΈ°μΈκΈ°κ° κ³μ°λλμ§μ λν μ€λͺ κ³Ό κΈ°λ³Έμ μΈ λ² λμ± μλ¬ λΆμμ λ€μκ³Ό κ°λ€: μ°λ¦¬λ λ€μ 곡μμ ν΅ν΄ λ€νΈμν¬κ° νμ t' μμλΆν° νμ t κΉμ§ νλ ¨ν μ΄νμ κ°μ€μΉλ₯Ό κ°±μ νλ€. (첫 λ²μ§Έ, λ λ²μ§Έ κ·Έλ¦Ό)
νμ Ο (with t` β€ Ο < t) μμ μ λ uμ μμ νλ μλ¬ μ νΈλ 20λ² κ³΅μκ³Ό κ°λ€.
Consequently, given a fully recurrent neural network with a set of non-input units U, the error signal that occurs at any chosen output-layer neuron o β O, at time-step Ο , is propagated back through time for tβt' time-steps, with t' < t to an arbitrary neuron v.
κ²°κ³Όμ μΌλ‘, νμ μ€νμ΄ Ο μΌλ, μ£Όμ΄μ§ μμ μν μ κ²½λ§κ³Ό non-μ λ ₯ μ λλ€μ μ§ν© U, νμμ€νμ΄ ΟμΌλμ μ΄λ ν μΆλ ₯ λ μ΄μ΄ λ΄λ° o β O μμ μΌμ΄λλ μλ¬ μ νΈλ, t` < t μΌλμ μ€μ¬ λ΄λ° v λ₯Ό ν₯ν΄ t-t' νμμ€ν λ΄λ΄ λ€λ‘ μ νλλ€.
This causes the error to be scaled by the following factor:
μ΄λ μλ¬κ° λ€μμ μμλ€μ μν΄ μ€μΌμΌλ§λλ€λ κ²μ λ»νλ€.
To solve the above equation, we unroll it over time. For t' β€ Ο β€ t, let u_Ο be a non-input-layer neuron in one of the replicas in the unrolled network at time Ο. Now, by setting ut = v and ut 0 = o, we obtain the equation
μμ μμ ν΄κ²°νκΈ° μν΄μλ, μκ°μ΄ μ§λ¨μ λ°λΌ μ΄κ²μ νΌμ³μΌ νλ€. t' β€ Ο β€ t μ λν΄μ, u_Ο λ₯Ό νμ Ο μ λν νΌμ³μ§λ μμ λ€νΈμν¬ μμ λ νλ¦¬μΉ΄λ€ μ€ νλ μμ non-μ λ ₯ λ μ΄μ΄ λ΄λ°μΌλ‘ λλ€. μ΄μ u_t = v, u_t' = o λ‘ μ€μ νλ©΄, 21λ² κ³΅μμ μ»μ μ μκ² λλ€.
Observing Equation 21, it follows that if
21λ² κ³΅μμ μ΄ν΄λ³΄λ©΄, λ§μ½ λ€μκ³Ό κ°μ κ²½μ°λ (22λ²)
for all Ο , then the product will grow exponentially, causing the error to blow-up; moreover, conflicting error signals arriving at neuron v can lead to oscillating weights and unstable learning. If now
λͺ¨λ Ο μ λν΄μ, μμ°λ¬Όμ΄ κΈ°νκΈμμ μΌλ‘ 컀μ§κ²λκ³ , μλ¬κ° νλ°νκ² λλ€; λ λμκ°, λ΄λ° v μ λμ°©νλ μλ‘ λ°λλλ μλ¬ μ νΈλ€μ μ§λνλ κ°μ€μΉμ λΆμμ ν νμ΅μ μΌκΈ°νκ² λλ€.
λ§μ½ λ€μκ³Ό κ°μ κ²½μ°λ (23λ²)
for all Ο , then the product decreases exponentially, causing the error to vanish, preventing the network from learning within an acceptable time period. Finally, the equation
λͺ¨λ Ο μ λν΄μ, μμ°λ¬Όμ΄ κΈ°νκΈμμ μΌλ‘ μμμ§κ² λμ΄ μλ¬κ° μ¬λΌμ§κ² λκ³ , λ€νΈμν¬κ° λ°μλ€μΌλ§ν μκ° λμμ νμ΅μ λ°©ν΄νλ€.
λ§μ§λ§μΌλ‘ λ€μ 곡μμ
shows that if the local error vanishes, then the global error also vanishes. A more detailed theoretical analysis of the problem with long-term dependencies is presented in [39]. The paper also briefly outlines several proposals on how to address this problem.
λ§μ½ λ‘컬 μλ¬κ° μ¬λΌμ§κ² λλ©΄, κΈλ‘λ² μλ¬ λν μ¬λΌμ§κ² λλ€λ κ²μ 보μ¬μ€λ€. μ΄ μ₯κΈ°κ° μμ‘΄μ± λ¬Έμ μ λν λ μμΈν μ΄λ‘ μ λΆμμ [39] μ λμμλ€. μ΄ λ Όλ¬Έμμλ μ΄λ»κ² μ΄ λ¬Έμ λ₯Ό λ€λ£° κ²μΈκ°μ λν λͺ κ°μ§ μ μμ κ°λ΅νκ² μμ ν κ²μ΄λ€.
dictionary
oscillating: μ§λνλ replicas: 볡μ νλ€ exponentially: κΈ°νκΈμμ μΌλ‘ outlines: μμ νλ€ proposals: μ μ