07 Solving the Vanishing Error Problem - PAI-yoonsung/lstm-paper GitHub Wiki

Standard RNN cannot bridge more than 5–10 time steps ([22]).

통상의 RNN 은 5-10 번의 타임 스탭보다 더 많이 연결될 수 없다.

This is due to that back-propagated error signals tend to either grow or shrink with every time step.

이것은 역전파 에러 신호는 매 타임 스탭마다 커지거나 작아지는 경향이 있기 때문이다.

Over many time steps the error therefore typically blows-up or vanishes ([5, 42]).

결과적으로, 많은 타임 스탭을 거듭하게 되면, 일반적으로 에러가 폭발하거나 사라지게 된다([5, 42]).

Blown-up error signals lead straight to oscillating weights, whereas with a vanishing error, learning takes an unacceptable amount of time, or does not work at all.

폭발한 에러 신호는 진동하는 가중치로 직결되는 반면, 사라지는 오류는, 학습에 과하게 오랜 시간이 걸리게 되거나, 작동하지 않게 된다.

통상의 역전파 알고리즘에 의해 어떻게 기울기가 계산되는지에 대한 설명과 기본적인 베니싱 에러 분석은 다음과 같다: 우리는 다음 공식을 통해 네트워크가 타임 t' 에서부터 타임 t 까지 훈련한 이후의 가중치를 갱신한다. (첫 번째, 두 번째 그림)
타임 τ (with t` ≤ τ < t) 에서 유닛 u의 역전파된 에러 신호는 20번 공식과 같다.

Consequently, given a fully recurrent neural network with a set of non-input units U, the error signal that occurs at any chosen output-layer neuron o ∈ O, at time-step τ , is propagated back through time for t−t' time-steps, with t' < t to an arbitrary neuron v.

결과적으로, 타임 스탭이 τ 일때, 주어진 완전 순환 신경망과 non-입력 유닛들의 집합 U, 타임스탭이 τ일때의 어떠한 출력 레이어 뉴런 o ∈ O 에서 일어나는 에러 신호는, t` < t 일때의 중재 뉴런 v 를 향해 t-t' 타임스탭 내내 뒤로 전파된다.

This causes the error to be scaled by the following factor:

이는 에러가 다음의 요소들에 의해 스케일링된다는 것을 뜻한다.

To solve the above equation, we unroll it over time. For t' ≤ τ ≤ t, let u_τ be a non-input-layer neuron in one of the replicas in the unrolled network at time τ. Now, by setting ut = v and ut 0 = o, we obtain the equation

위의 식을 해결하기 위해서는, 시간이 지남에 따라 이것을 펼쳐야 한다. t' ≤ τ ≤ t 에 대해서, u_τ 를 타임 τ 에 대한 펼쳐지니 않은 네트워크 안의 레플리카들 중 하나 안의 non-입력 레이어 뉴런으로 둔다. 이제 u_t = v, u_t' = o 로 설정하면, 21번 공식을 얻을 수 있게 된다.

Observing Equation 21, it follows that if

21번 공식을 살펴보면, 만약 다음과 같은 경우는 (22번)

for all τ , then the product will grow exponentially, causing the error to blow-up; moreover, conflicting error signals arriving at neuron v can lead to oscillating weights and unstable learning. If now

모든 τ 에 대해서, 생산물이 기하급수적으로 커지게되고, 에러가 폭발하게 된다; 더 나아가, 뉴런 v 에 도착하는 서로 반대되는 에러 신호들은 진동하는 가중치와 불안정한 학습을 야기하게 된다.

만약 다음과 같은 경우는 (23번)

for all τ , then the product decreases exponentially, causing the error to vanish, preventing the network from learning within an acceptable time period. Finally, the equation

모든 τ 에 대해서, 생산물이 기하급수적으로 작아지게 되어 에러가 사라지게 되고, 네트워크가 받아들일만한 시간 동안의 학습을 방해한다.

마지막으로 다음 공식은

shows that if the local error vanishes, then the global error also vanishes. A more detailed theoretical analysis of the problem with long-term dependencies is presented in [39]. The paper also briefly outlines several proposals on how to address this problem.

만약 로컬 에러가 사라지게 되면, 글로벌 에러 또한 사라지게 된다는 것을 보여준다. 이 장기간 의존성 문제에 대한 더 자세한 이론적 분석은 [39] 에 나와있다. 이 논문에서도 어떻게 이 문제를 다룰 것인가에 대한 몇 가지 제안을 간략하게 서술할 것이다.

dictionary

oscillating: 진동하는 replicas: 복제품들 exponentially: 기하급수적으로 outlines: 서술하다 proposals: 제안