06 Training Recurrent Neural Networks - PAI-yoonsung/lstm-paper GitHub Wiki

The most common methods to train recurrent neural networks are Backpropagation Through Time (BPTT) [62, 74, 75] and Real-Time Recurrent Learning (RTRL) [75, 76], whereas BPTT is the most common method.

순환 신경망을 학습시키기 위한 가장 일반적인 메소드는 Backpropagation Through Time (BPTT) 과 Real-Time Recurrent Learning (RTRL) 알고리즘이다. (BPTT가 제일 흔하게 사용된다)

The main difference between BPTT and RTRL is the way the weight changes are calculated.

BPTT 와 RTRL 의 주요 차이점은 가중치의 변화가 어떻게 계산되느냐이다.

The original formulation of LSTM-RNNs used a combination of BPTT and RTRL.

LSTM-RNNs 의 초창기 계산식은 BPTT 와 RTRL 의 조합을 사용한다.

Therefore we cover both learning algorithms in short.

그러므로, 이 두 학습 알고리즘에 대해 간략하게 살펴보겠다.

6.1 Backpropagation Through Time

The BPTT algorithm makes use of the fact that, for a finite period of time, there is an FFNN with identical behaviour for every RNN.

BPTT 알고리즘은 제한된 시간동안 매번의 RNN에서 동일한 행동을 하는 순전파 신경망이 있다는 점을 이용한다.

To obtain this FFNN, we need to unfold the RNN in time.

FFNN 을 얻기 위해선, RNN 을 제시간에 펼쳐야 한다.

Figure 9a shows a simple, fully recurrent neural network with a single two-neuron layer.

그림 9a는 간단하고, 2개의 단일 뉴런 레이어로 된 완전히 연결된 순환 신경망을 보여준다.

The corresponding feed-forward neural network, shown in Figure 9b, requires a separate layer for each time step with the same weights for all layers.

그림 9b에서 보이는 이에 상응하는 순전파 신경망은, 매 스탭마다 모든 레이어들에 대해 같은 가중치를 갖는 분리된 레이어를 요구한다.

If weights are identical to the RNN, both networks show the same behaviour.

만약 가중치가 RNN 과 동일하다면 , 두 신경망은 같은 행동을 보이게 된다.

The unfolded network can be trained using the backpropagation algorithm described in Section 4.

펼쳐진 레이어는 섹션 4에 묘사된 역전파 알고리즘을 사용하여 훈련될 수 있다.

At the end of a training sequence, the network is unfolded in time.

마지막 훈련 단계에서, 신경망은 제시간에 펼쳐지게 된다.

The error is calculated for the output units with existing target values using some chosen error measure.

에러는 몇몇 선택된 에러 측정 방식을 사용하여 존재하는 목표 변수들과 출력 유닛에 대하여 계산된다.

Then, the error is injected backwards into the network and the weight updates for all time steps calculated.

그러고나면, 에러는 네트워크 후방에서 주입되고, 매 반복마다 계산된 에러에 대한 가중치가 갱신된다.

The weights in the recurrent version of the network are updated with the sum of its deltas over all time steps.

신경망의 순환 버전의 가중치들은 모든 타임스탭이 종료된 이후의 델타값들의 합에 대하여 갱신된다.

Figure 9 는 2개의 뉴런 레이어를 가진 간단한 완전 연결 신경망을 보여준다. 각각의 타임 스탭에 대하여 분리된 레이어를 갖는 a와 동일하지만 반복 이후 펼쳐진 네트워크는 b에 그려져있다. 후자는 순전파 신경망을 표현한 것이다.

t 는 위의 6번 공식에 의해 주어지게 된다. 또한 7번 공식의 가중치가 적용된 입력값도 함께 적용된다.

where v ∈ U ∩Pre (u) and i ∈ I, the set of input units.

v ∈ U ∩ Pre (u) 와 i ∈ I 는 입력 유닛들의 집합을 나타낸다.

Note that the inputs to u at time τ +1 are of two types: the environmental input that arrives at time τ +1 via the input units, and the recurrent output from all non-input units in the network produced at time τ .

τ +1 타임일 때, u 를 향한 입력값들은 두 가지 타입이 있다: 입력 유닛에 의해 τ +1 타임에 도착하는 환경 입력과 τ 시타임에 네트워크에서 생성된 모든 non-입력 유닛들로부터 오는 순환 출력이다.

If the network is fully connected, then U ∩ Pre (u) is equal to the set U of non-input units.

만약 네트워크가 완전 연결이라면, U ∩ Pre(u) 는 non-입력 유닛의 집합 U와 동일하다.

Let T(τ ) be the set of non-input units for which, at time τ , the output value yu(τ ) of the unit u ∈ T(τ ) should match some target value du(τ ).

T(τ) 를 τ 타임의 non-입력 유닛들의 집합이라고 둘 때, u ∈ T(τ ) 유닛의 출력값 yu(τ ) 는 일부 목적 변수 du(τ ) 와 일치해야 한다.

The cost function is the summed error E_total(t', t) for the epoch t', t' + 1, . . . , t, which we want to minimise using a learning algorithm.

비용 함수는 에포크 t' 에서 t 까지에 대한 에러의 총합 E_total(t', t) 이다. 우리는 이 에러의 총합을 학습 알고리즘을 사용하여 최소화하는 것을 목표로 한다.

에러의 총합은 8번 공식에 의해 정의된다. 타임 τ 에서의 에러 E(τ) 는 목적 함수로 squared error 를 사용하여 정의된다. (9번 공식) 타임 τ 에서의 non-입력 유닛 u 의 에러 e_u(τ ) 는 10번 공식을 통해 정의된다.

가중치를 적용시키기 위해, τ 타임의 non-입력 유닛 u 의 에러 신호 ϑ_u(τ) 11번 공식과 같이 정의된다.
ϑ_u 를 풀게 되면, 12번 식과 동일한 것을 얻게된다.

t' 시간의 역전파 계산 이후, 가중치는 순환 버전 네트워크의 ∆W[u,v]를 갱신한다. 이는 모든 타임 스탭에 대해 상응하는 가중치 갱신들을 합치는 것으로 마치게 된다.

BPTT 에 대한 더 자세한 정보는 [74],[62],[76] 에 나와있다.

6.2 Real-Time Recurrent Learning

The RTRL algorithm does not require error propagation.

RTRL 알고리즘에서는 에러 전파가 필요하지 않다.

All the information necessary to compute the gradient is collected as the input stream is presented to the network.

기울기를 계산하기 위해 필요한 모든 정보는 네트워크로 제공되는 입력 스트림에 의해 모아지게 된다.

This makes a dedicated training interval obsolete.

이는 특정한 훈련 간격을 필요하지 않게 합니다.

The algorithm comes at significant computational cost per update cycle, and the stored information is non-local; i.e., we need an additional notion called sensitivity of the output, which we’ll explain later.

이 알고리즘은 매 갱신 사이클마다 상당한 계산 자원이 들고, 저장되는 정보가 non-로컬 하다. 즉, 출력의 민감도라 불리우는 추가적인 개념이 필요하고, 이는 나중에 설명할 것이다.

Nevertheless, the memory required depends only on the size of the network and not on the size of the input.

반면에, 메모리 요구량은 입력의 크기가 아닌 오직 네트워크의 크기에 의존한다.

Following the notation from the previous section, we will now define for the network units v ∈ I ∪ U and u, k ∈ U, and the time steps t' ≤ τ ≤ t.

다음은 이전 섹션에 나왔던 개념으로, 네트워크 유닛 v ∈ I ∪ U and u, k ∈ U 그리고 타임 스탭 t' ≤ τ ≤ t 를 정의한다.

Unlike BPTT, in RTRL we assume the existence of a label d_k(τ ) at every time τ (given that it is an online algorithm) for every non-input unit k, so the training objective is to minimise the overall network error, which is given at time step τ by

BPTT 와는 달리, RTRL은 모든 non-입력 유닛 k에 대하여 모든 타임 τ(온라인 알고리즘에서 주어진) 에서 라벨 d_k(τ) 의 존재를 추정한다. 즉, 학습 목적은 다음의 공식을 통해 타임 스탭 τ에서 주어지는 전반적인 네트워크 에러를 최소화하는 것이다.

We conclude from Equation 8 that the gradient of the total error is also the sum of the gradient for all previous time steps and the current time step:

우리는 총 에러의 기울기를 계산하는 방정식 8이 모든 이전 타임 스탭과 현재 타임 스탭의 기울기들의 합이라는 결론을 지을 수 있다.

During presentation of the time series to the network, we need to accumulate the values of the gradient at each time step. Thus, we can also keep track of the weight changes ∆Wu,v. After presentation, the overall weight change for W[u,v] is then given by

네트워크에 시계열이 존재하는 동안, 각 타임스탭의 기울기 값들을 누적시켜야 한다. 그러므로, 기울기의 변화 ∆Wu,v 를 쫓아가야 한다. 존재 이후, 전반적인 W[u,v] 에 대한 가중치 변화는 다음과 같이 주어진다.

가중치 변화를 얻기 위해서는, 위 그림에서 두 번째 식을 사용해 계산해야 한다. 각 타임스탭 t 에 대해서, 경사하강법에 따라 방정식을 펼친 이후 와, 방정식 9를 적용시키는 것으로, 아래의 14번과 같은 공식을 얻을 수 있다.

Since the error ek(τ ) = dk(τ ) − yk(τ ) is always known, we need to find a way to calculate the second factor only. We define the quantity

에러 e_k(τ ) = d_k(τ ) − y_k(τ ) 는 언제나 알려져있기 때문에, 두 번째 요소(?) 만 찾으면 된다. 그 방정식은 다음 15번 공식과 같이 정의된다.

which measures the sensitivity of the output of unit k at time τ to a small change in the weight W[u,v], in due consideration of the effect of such a change in the weight over the entire network trajectory from time t' to t.

해당 식은 타입 τ 에서 유닛 k 의 가중치 W[u,v] 안에서의 작은 변화에 대한 출력의 감도를 측정한다. 타임 t' 에서 t 방향으로 가는 네트워크 전반에 걸친 가중치에서의 이러한 변화 영향을 고려한(?)

The weight W[u,v] does not have to be connected to unit k, which makes the algorithm non-local.

가중치 W[u,v] 는 유닛 k 와 반드시 연결될 필요는 없다. 이 사실은 이 알고리즘을 non-local 하게 만든다.

Local changes in the network can have an effect anywhere in the network.

네트워크에서 로컬 변화는 네트워크 어디서든 영향을 미칠 수 있다.

In RTRL, the gradient information is forward-propagated. Using Equations 6 and 7, the output y_k(t + 1) at time step t + 1 is given by

RTRL 에서 기울기 정보는 전방으로 전파된다. 방정식 6과 7을 사용하여, 타임 스탭 t + 1 에서의 출력 y_k(t + 1) 는 16번과 같이 주어진다.

가중치가 적용된 인풋을 함께 적용시킬 경우, 17번 공식같이 된다.

방정식 15, 16, 17 에 차이를 두는 것으로, 모든 타임 스탭 ≥ t + 1 결과를 다음과 같이 계산할 수 있다.

δ_uk 는 Kronecker delta(?) 이고, 이는

    δ_uk = 1 if u = k  
           0 if otherwise

여기서, 네트워크의 초기 상태는 가중치에 대해 기능적 의존성을 갖고 있지 않다는 가정 하에, 첫 번째 타임 스탭의 미분은 다음과 같다.

Equation 18 shows how p^k_uv(t + 1) can be calculated in terms of p^k_uv(t).

방정식 18번은 p^k_uv(t) 의 관점에서 어떻게 p^k_uv(t + 1) 가 계산되는 지를 보여준다.

In this sense, the learning algorithm becomes incremental, so that we can learn as we receive new inputs (in real time), and we no longer need to perform back-propagation through time.

이 개념에서, 학습 알고리즘은 증분이 되고(?), 새로운 입력을 실시간으로 받기 때문에 학습시킬 수 있다. 그러므로, 더 이상 시간 내내(?) 역전파를 수행할 필요가 없어진다.

Knowing the initial value for p^k_uv at time t' from Equation 19, we can recursively calculate the quantities p^k_uv for the first and all subsequent time steps using Equation 18.

공식 19로부터 시간 t' 에서의 p^k_uv 에 대한 초기값을 아는 것은, 18번 방정식을 사용하는 첫 번째와 후속의 모든 타임 스탭에 대해 p^k_uv의 양을 순환적으로 계산할 수 있다.

Note that p^k_uv(τ ) uses the values of W[u,v] at t', and not values in-between t' and τ.

p^k_uv(τ) 는 타임 t' 에서의 W[u,v] 를 사용하는 것이지, 타임 t'와 τ 사이의 값을 사용하는 것이 아니라는 점을 주의하여야 한다.

Combining these values with the error vector e(τ ) for that time step, using Equation 14, we can finally calculate the negative error gradient ∇WE(τ ).

방정식 14를 사용하여 타임 스탭 τ 에 대한 에러 벡터 e(τ ) 값들을 합치는 것으로, 마침내 네거티브 에러 기울기 ∇WE(τ ) 를 계산하는 것이 가능하다.

The final weight change for W[u,v] can be calculated using Equations 14 and 13.

W[u,v]에 대한 최종 가중치 변경은 방정식 14와 13을 통해 계산할 수 있다.

A more detailed description of the RTRL algorithm is given in [75] and [76].

RTRL 알고리즘에 대한 더욱 상세한 묘사는 [75], [76] 에 나와있다.

dictionary

recurrent: 순환하는 unroll: 풀다 stream: (사전적 의미)흐름, (데이터 관점)데이터,패킷,비트 등의 일련의 연속성을 갖는 흐름 dedicated: 특정한 interval: 간격 obsolete: 쓸모없는, 필요없는 notion: 개념 assume: 추정하다 equation: 방정식 trajectory: 방향, 궤적 incremental: 증분(?) perform: 이행하다, 수행하다