06 Training Recurrent Neural Networks - PAI-yoonsung/lstm-paper GitHub Wiki

The most common methods to train recurrent neural networks are Backpropagation Through Time (BPTT) [62, 74, 75] and Real-Time Recurrent Learning (RTRL) [75, 76], whereas BPTT is the most common method.

μˆœν™˜ 신경망을 ν•™μŠ΅μ‹œν‚€κΈ° μœ„ν•œ κ°€μž₯ 일반적인 λ©”μ†Œλ“œλŠ” Backpropagation Through Time (BPTT) κ³Ό Real-Time Recurrent Learning (RTRL) μ•Œκ³ λ¦¬μ¦˜μ΄λ‹€. (BPTTκ°€ 제일 ν”ν•˜κ²Œ μ‚¬μš©λœλ‹€)

The main difference between BPTT and RTRL is the way the weight changes are calculated.

BPTT 와 RTRL 의 μ£Όμš” 차이점은 κ°€μ€‘μΉ˜μ˜ λ³€ν™”κ°€ μ–΄λ–»κ²Œ κ³„μ‚°λ˜λŠλƒμ΄λ‹€.

The original formulation of LSTM-RNNs used a combination of BPTT and RTRL.

LSTM-RNNs 의 초창기 계산식은 BPTT 와 RTRL 의 쑰합을 μ‚¬μš©ν•œλ‹€.

Therefore we cover both learning algorithms in short.

κ·ΈλŸ¬λ―€λ‘œ, 이 두 ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ— λŒ€ν•΄ κ°„λž΅ν•˜κ²Œ μ‚΄νŽ΄λ³΄κ² λ‹€.

6.1 Backpropagation Through Time

The BPTT algorithm makes use of the fact that, for a finite period of time, there is an FFNN with identical behaviour for every RNN.

BPTT μ•Œκ³ λ¦¬μ¦˜μ€ μ œν•œλœ μ‹œκ°„λ™μ•ˆ 맀번의 RNNμ—μ„œ λ™μΌν•œ 행동을 ν•˜λŠ” μˆœμ „νŒŒ 신경망이 μžˆλ‹€λŠ” 점을 μ΄μš©ν•œλ‹€.

To obtain this FFNN, we need to unfold the RNN in time.

FFNN 을 μ–»κΈ° μœ„ν•΄μ„ , RNN 을 μ œμ‹œκ°„μ— νŽΌμ³μ•Ό ν•œλ‹€.

Figure 9a shows a simple, fully recurrent neural network with a single two-neuron layer.

κ·Έλ¦Ό 9aλŠ” κ°„λ‹¨ν•˜κ³ , 2개의 단일 λ‰΄λŸ° λ ˆμ΄μ–΄λ‘œ 된 μ™„μ „νžˆ μ—°κ²°λœ μˆœν™˜ 신경망을 보여쀀닀.

The corresponding feed-forward neural network, shown in Figure 9b, requires a separate layer for each time step with the same weights for all layers.

κ·Έλ¦Ό 9bμ—μ„œ λ³΄μ΄λŠ” 이에 μƒμ‘ν•˜λŠ” μˆœμ „νŒŒ 신경망은, λ§€ μŠ€νƒ­λ§ˆλ‹€ λͺ¨λ“  λ ˆμ΄μ–΄λ“€μ— λŒ€ν•΄ 같은 κ°€μ€‘μΉ˜λ₯Ό κ°–λŠ” λΆ„λ¦¬λœ λ ˆμ΄μ–΄λ₯Ό μš”κ΅¬ν•œλ‹€.

If weights are identical to the RNN, both networks show the same behaviour.

λ§Œμ•½ κ°€μ€‘μΉ˜κ°€ RNN κ³Ό λ™μΌν•˜λ‹€λ©΄ , 두 신경망은 같은 행동을 보이게 λœλ‹€.

The unfolded network can be trained using the backpropagation algorithm described in Section 4.

νŽΌμ³μ§„ λ ˆμ΄μ–΄λŠ” μ„Ήμ…˜ 4에 λ¬˜μ‚¬λœ μ—­μ „νŒŒ μ•Œκ³ λ¦¬μ¦˜μ„ μ‚¬μš©ν•˜μ—¬ ν›ˆλ ¨λ  수 μžˆλ‹€.

At the end of a training sequence, the network is unfolded in time.

λ§ˆμ§€λ§‰ ν›ˆλ ¨ λ‹¨κ³„μ—μ„œ, 신경망은 μ œμ‹œκ°„μ— νŽΌμ³μ§€κ²Œ λœλ‹€.

The error is calculated for the output units with existing target values using some chosen error measure.

μ—λŸ¬λŠ” λͺ‡λͺ‡ μ„ νƒλœ μ—λŸ¬ μΈ‘μ • 방식을 μ‚¬μš©ν•˜μ—¬ μ‘΄μž¬ν•˜λŠ” λͺ©ν‘œ λ³€μˆ˜λ“€κ³Ό 좜λ ₯ μœ λ‹›μ— λŒ€ν•˜μ—¬ κ³„μ‚°λœλ‹€.

Then, the error is injected backwards into the network and the weight updates for all time steps calculated.

κ·ΈλŸ¬κ³ λ‚˜λ©΄, μ—λŸ¬λŠ” λ„€νŠΈμ›Œν¬ ν›„λ°©μ—μ„œ μ£Όμž…λ˜κ³ , λ§€ λ°˜λ³΅λ§ˆλ‹€ κ³„μ‚°λœ μ—λŸ¬μ— λŒ€ν•œ κ°€μ€‘μΉ˜κ°€ κ°±μ‹ λœλ‹€.

The weights in the recurrent version of the network are updated with the sum of its deltas over all time steps.

μ‹ κ²½λ§μ˜ μˆœν™˜ λ²„μ „μ˜ κ°€μ€‘μΉ˜λ“€μ€ λͺ¨λ“  νƒ€μž„μŠ€νƒ­μ΄ μ’…λ£Œλœ μ΄ν›„μ˜ λΈνƒ€κ°’λ“€μ˜ 합에 λŒ€ν•˜μ—¬ κ°±μ‹ λœλ‹€.

Figure 9 λŠ” 2개의 λ‰΄λŸ° λ ˆμ΄μ–΄λ₯Ό κ°€μ§„ κ°„λ‹¨ν•œ μ™„μ „ μ—°κ²° 신경망을 보여쀀닀. 각각의 νƒ€μž„ μŠ€νƒ­μ— λŒ€ν•˜μ—¬ λΆ„λ¦¬λœ λ ˆμ΄μ–΄λ₯Ό κ°–λŠ” a와 λ™μΌν•˜μ§€λ§Œ 반볡 이후 νŽΌμ³μ§„ λ„€νŠΈμ›Œν¬λŠ” b에 κ·Έλ €μ Έμžˆλ‹€. ν›„μžλŠ” μˆœμ „νŒŒ 신경망을 ν‘œν˜„ν•œ 것이닀.

t λŠ” μœ„μ˜ 6번 곡식에 μ˜ν•΄ μ£Όμ–΄μ§€κ²Œ λœλ‹€. λ˜ν•œ 7번 κ³΅μ‹μ˜ κ°€μ€‘μΉ˜κ°€ 적용된 μž…λ ₯값도 ν•¨κ»˜ μ μš©λœλ‹€.

where v ∈ U ∩Pre (u) and i ∈ I, the set of input units.

v ∈ U ∩ Pre (u) 와 i ∈ I λŠ” μž…λ ₯ μœ λ‹›λ“€μ˜ 집합을 λ‚˜νƒ€λ‚Έλ‹€.

Note that the inputs to u at time Ο„ +1 are of two types: the environmental input that arrives at time Ο„ +1 via the input units, and the recurrent output from all non-input units in the network produced at time Ο„ .

Ο„ +1 νƒ€μž„μΌ λ•Œ, u λ₯Ό ν–₯ν•œ μž…λ ₯값듀은 두 κ°€μ§€ νƒ€μž…μ΄ μžˆλ‹€: μž…λ ₯ μœ λ‹›μ— μ˜ν•΄ Ο„ +1 νƒ€μž„μ— λ„μ°©ν•˜λŠ” ν™˜κ²½ μž…λ ₯κ³Ό Ο„ μ‹œνƒ€μž„μ— λ„€νŠΈμ›Œν¬μ—μ„œ μƒμ„±λœ λͺ¨λ“  non-μž…λ ₯ μœ λ‹›λ“€λ‘œλΆ€ν„° μ˜€λŠ” μˆœν™˜ 좜λ ₯이닀.

If the network is fully connected, then U ∩ Pre (u) is equal to the set U of non-input units.

λ§Œμ•½ λ„€νŠΈμ›Œν¬κ°€ μ™„μ „ 연결이라면, U ∩ Pre(u) λŠ” non-μž…λ ₯ μœ λ‹›μ˜ μ§‘ν•© U와 λ™μΌν•˜λ‹€.

Let T(Ο„ ) be the set of non-input units for which, at time Ο„ , the output value yu(Ο„ ) of the unit u ∈ T(Ο„ ) should match some target value du(Ο„ ).

T(Ο„) λ₯Ό Ο„ νƒ€μž„μ˜ non-μž…λ ₯ μœ λ‹›λ“€μ˜ 집합이라고 λ‘˜ λ•Œ, u ∈ T(Ο„ ) μœ λ‹›μ˜ 좜λ ₯κ°’ yu(Ο„ ) λŠ” 일뢀 λͺ©μ  λ³€μˆ˜ du(Ο„ ) 와 μΌμΉ˜ν•΄μ•Ό ν•œλ‹€.

The cost function is the summed error E_total(t', t) for the epoch t', t' + 1, . . . , t, which we want to minimise using a learning algorithm.

λΉ„μš© ν•¨μˆ˜λŠ” 에포크 t' μ—μ„œ t κΉŒμ§€μ— λŒ€ν•œ μ—λŸ¬μ˜ 총합 E_total(t', t) 이닀. μš°λ¦¬λŠ” 이 μ—λŸ¬μ˜ 총합을 ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ„ μ‚¬μš©ν•˜μ—¬ μ΅œμ†Œν™”ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€.

μ—λŸ¬μ˜ 총합은 8번 곡식에 μ˜ν•΄ μ •μ˜λœλ‹€. νƒ€μž„ Ο„ μ—μ„œμ˜ μ—λŸ¬ E(Ο„) λŠ” λͺ©μ  ν•¨μˆ˜λ‘œ squared error λ₯Ό μ‚¬μš©ν•˜μ—¬ μ •μ˜λœλ‹€. (9번 곡식) νƒ€μž„ Ο„ μ—μ„œμ˜ non-μž…λ ₯ μœ λ‹› u 의 μ—λŸ¬ e_u(Ο„ ) λŠ” 10번 곡식을 톡해 μ •μ˜λœλ‹€.

κ°€μ€‘μΉ˜λ₯Ό μ μš©μ‹œν‚€κΈ° μœ„ν•΄, Ο„ νƒ€μž„μ˜ non-μž…λ ₯ μœ λ‹› u 의 μ—λŸ¬ μ‹ ν˜Έ Ο‘_u(Ο„) 11번 곡식과 같이 μ •μ˜λœλ‹€.
Ο‘_u λ₯Ό ν’€κ²Œ 되면, 12번 식과 λ™μΌν•œ 것을 μ–»κ²Œλœλ‹€.

t' μ‹œκ°„μ˜ μ—­μ „νŒŒ 계산 이후, κ°€μ€‘μΉ˜λŠ” μˆœν™˜ 버전 λ„€νŠΈμ›Œν¬μ˜ βˆ†W[u,v]λ₯Ό κ°±μ‹ ν•œλ‹€. μ΄λŠ” λͺ¨λ“  νƒ€μž„ μŠ€νƒ­μ— λŒ€ν•΄ μƒμ‘ν•˜λŠ” κ°€μ€‘μΉ˜ 갱신듀을 ν•©μΉ˜λŠ” κ²ƒμœΌλ‘œ 마치게 λœλ‹€.

BPTT 에 λŒ€ν•œ 더 μžμ„Έν•œ μ •λ³΄λŠ” [74],[62],[76] 에 λ‚˜μ™€μžˆλ‹€.

6.2 Real-Time Recurrent Learning

The RTRL algorithm does not require error propagation.

RTRL μ•Œκ³ λ¦¬μ¦˜μ—μ„œλŠ” μ—λŸ¬ μ „νŒŒκ°€ ν•„μš”ν•˜μ§€ μ•Šλ‹€.

All the information necessary to compute the gradient is collected as the input stream is presented to the network.

기울기λ₯Ό κ³„μ‚°ν•˜κΈ° μœ„ν•΄ ν•„μš”ν•œ λͺ¨λ“  μ •λ³΄λŠ” λ„€νŠΈμ›Œν¬λ‘œ μ œκ³΅λ˜λŠ” μž…λ ₯ μŠ€νŠΈλ¦Όμ— μ˜ν•΄ λͺ¨μ•„μ§€κ²Œ λœλ‹€.

This makes a dedicated training interval obsolete.

μ΄λŠ” νŠΉμ •ν•œ ν›ˆλ ¨ 간격을 ν•„μš”ν•˜μ§€ μ•Šκ²Œ ν•©λ‹ˆλ‹€.

The algorithm comes at significant computational cost per update cycle, and the stored information is non-local; i.e., we need an additional notion called sensitivity of the output, which we’ll explain later.

이 μ•Œκ³ λ¦¬μ¦˜μ€ λ§€ κ°±μ‹  μ‚¬μ΄ν΄λ§ˆλ‹€ μƒλ‹Ήν•œ 계산 μžμ›μ΄ λ“€κ³ , μ €μž₯λ˜λŠ” 정보가 non-둜컬 ν•˜λ‹€. 즉, 좜λ ₯의 민감도라 λΆˆλ¦¬μš°λŠ” 좔가적인 κ°œλ…μ΄ ν•„μš”ν•˜κ³ , μ΄λŠ” λ‚˜μ€‘μ— μ„€λͺ…ν•  것이닀.

Nevertheless, the memory required depends only on the size of the network and not on the size of the input.

λ°˜λ©΄μ—, λ©”λͺ¨λ¦¬ μš”κ΅¬λŸ‰μ€ μž…λ ₯의 크기가 μ•„λ‹Œ 였직 λ„€νŠΈμ›Œν¬μ˜ 크기에 μ˜μ‘΄ν•œλ‹€.

Following the notation from the previous section, we will now define for the network units v ∈ I βˆͺ U and u, k ∈ U, and the time steps t' ≀ Ο„ ≀ t.

λ‹€μŒμ€ 이전 μ„Ήμ…˜μ— λ‚˜μ™”λ˜ κ°œλ…μœΌλ‘œ, λ„€νŠΈμ›Œν¬ μœ λ‹› v ∈ I βˆͺ U and u, k ∈ U 그리고 νƒ€μž„ μŠ€νƒ­ t' ≀ Ο„ ≀ t λ₯Ό μ •μ˜ν•œλ‹€.

Unlike BPTT, in RTRL we assume the existence of a label d_k(Ο„ ) at every time Ο„ (given that it is an online algorithm) for every non-input unit k, so the training objective is to minimise the overall network error, which is given at time step Ο„ by

BPTT μ™€λŠ” 달리, RTRL은 λͺ¨λ“  non-μž…λ ₯ μœ λ‹› k에 λŒ€ν•˜μ—¬ λͺ¨λ“  νƒ€μž„ Ο„(온라인 μ•Œκ³ λ¦¬μ¦˜μ—μ„œ μ£Όμ–΄μ§„) μ—μ„œ 라벨 d_k(Ο„) 의 쑴재λ₯Ό μΆ”μ •ν•œλ‹€. 즉, ν•™μŠ΅ λͺ©μ μ€ λ‹€μŒμ˜ 곡식을 톡해 νƒ€μž„ μŠ€νƒ­ Ο„μ—μ„œ μ£Όμ–΄μ§€λŠ” μ „λ°˜μ μΈ λ„€νŠΈμ›Œν¬ μ—λŸ¬λ₯Ό μ΅œμ†Œν™”ν•˜λŠ” 것이닀.

We conclude from Equation 8 that the gradient of the total error is also the sum of the gradient for all previous time steps and the current time step:

μš°λ¦¬λŠ” 총 μ—λŸ¬μ˜ 기울기λ₯Ό κ³„μ‚°ν•˜λŠ” 방정식 8이 λͺ¨λ“  이전 νƒ€μž„ μŠ€νƒ­κ³Ό ν˜„μž¬ νƒ€μž„ μŠ€νƒ­μ˜ κΈ°μšΈκΈ°λ“€μ˜ ν•©μ΄λΌλŠ” 결둠을 지을 수 μžˆλ‹€.

During presentation of the time series to the network, we need to accumulate the values of the gradient at each time step. Thus, we can also keep track of the weight changes βˆ†Wu,v. After presentation, the overall weight change for W[u,v] is then given by

λ„€νŠΈμ›Œν¬μ— μ‹œκ³„μ—΄μ΄ μ‘΄μž¬ν•˜λŠ” λ™μ•ˆ, 각 νƒ€μž„μŠ€νƒ­μ˜ 기울기 값듀을 λˆ„μ μ‹œμΌœμ•Ό ν•œλ‹€. κ·ΈλŸ¬λ―€λ‘œ, 기울기의 λ³€ν™” βˆ†Wu,v λ₯Ό μ«“μ•„κ°€μ•Ό ν•œλ‹€. 쑴재 이후, μ „λ°˜μ μΈ W[u,v] 에 λŒ€ν•œ κ°€μ€‘μΉ˜ λ³€ν™”λŠ” λ‹€μŒκ³Ό 같이 μ£Όμ–΄μ§„λ‹€.

κ°€μ€‘μΉ˜ λ³€ν™”λ₯Ό μ–»κΈ° μœ„ν•΄μ„œλŠ”, μœ„ κ·Έλ¦Όμ—μ„œ 두 번째 식을 μ‚¬μš©ν•΄ 계산해야 ν•œλ‹€. 각 νƒ€μž„μŠ€νƒ­ t 에 λŒ€ν•΄μ„œ, κ²½μ‚¬ν•˜κ°•λ²•μ— 따라 방정식을 펼친 이후 와, 방정식 9λ₯Ό μ μš©μ‹œν‚€λŠ” κ²ƒμœΌλ‘œ, μ•„λž˜μ˜ 14번과 같은 곡식을 얻을 수 μžˆλ‹€.

Since the error ek(Ο„ ) = dk(Ο„ ) βˆ’ yk(Ο„ ) is always known, we need to find a way to calculate the second factor only. We define the quantity

μ—λŸ¬ e_k(Ο„ ) = d_k(Ο„ ) βˆ’ y_k(Ο„ ) λŠ” μ–Έμ œλ‚˜ μ•Œλ €μ ΈμžˆκΈ° λ•Œλ¬Έμ—, 두 번째 μš”μ†Œ(?) 만 찾으면 λœλ‹€. κ·Έ 방정식은 λ‹€μŒ 15번 곡식과 같이 μ •μ˜λœλ‹€.

which measures the sensitivity of the output of unit k at time Ο„ to a small change in the weight W[u,v], in due consideration of the effect of such a change in the weight over the entire network trajectory from time t' to t.

ν•΄λ‹Ή 식은 νƒ€μž… Ο„ μ—μ„œ μœ λ‹› k 의 κ°€μ€‘μΉ˜ W[u,v] μ•ˆμ—μ„œμ˜ μž‘μ€ 변화에 λŒ€ν•œ 좜λ ₯의 감도λ₯Ό μΈ‘μ •ν•œλ‹€. νƒ€μž„ t' μ—μ„œ t λ°©ν–₯으둜 κ°€λŠ” λ„€νŠΈμ›Œν¬ μ „λ°˜μ— 걸친 κ°€μ€‘μΉ˜μ—μ„œμ˜ μ΄λŸ¬ν•œ λ³€ν™” 영ν–₯을 κ³ λ €ν•œ(?)

The weight W[u,v] does not have to be connected to unit k, which makes the algorithm non-local.

κ°€μ€‘μΉ˜ W[u,v] λŠ” μœ λ‹› k 와 λ°˜λ“œμ‹œ 연결될 ν•„μš”λŠ” μ—†λ‹€. 이 사싀은 이 μ•Œκ³ λ¦¬μ¦˜μ„ non-local ν•˜κ²Œ λ§Œλ“ λ‹€.

Local changes in the network can have an effect anywhere in the network.

λ„€νŠΈμ›Œν¬μ—μ„œ 둜컬 λ³€ν™”λŠ” λ„€νŠΈμ›Œν¬ μ–΄λ””μ„œλ“  영ν–₯을 λ―ΈμΉ  수 μžˆλ‹€.

In RTRL, the gradient information is forward-propagated. Using Equations 6 and 7, the output y_k(t + 1) at time step t + 1 is given by

RTRL μ—μ„œ 기울기 μ •λ³΄λŠ” μ „λ°©μœΌλ‘œ μ „νŒŒλœλ‹€. 방정식 6κ³Ό 7을 μ‚¬μš©ν•˜μ—¬, νƒ€μž„ μŠ€νƒ­ t + 1 μ—μ„œμ˜ 좜λ ₯ y_k(t + 1) λŠ” 16번과 같이 μ£Όμ–΄μ§„λ‹€.

κ°€μ€‘μΉ˜κ°€ 적용된 인풋을 ν•¨κ»˜ μ μš©μ‹œν‚¬ 경우, 17번 곡식같이 λœλ‹€.

방정식 15, 16, 17 에 차이λ₯Ό λ‘λŠ” κ²ƒμœΌλ‘œ, λͺ¨λ“  νƒ€μž„ μŠ€νƒ­ β‰₯ t + 1 κ²°κ³Όλ₯Ό λ‹€μŒκ³Ό 같이 계산할 수 μžˆλ‹€.

Ξ΄_uk λŠ” Kronecker delta(?) 이고, μ΄λŠ”

    Ξ΄_uk = 1 if u = k  
           0 if otherwise

μ—¬κΈ°μ„œ, λ„€νŠΈμ›Œν¬μ˜ 초기 μƒνƒœλŠ” κ°€μ€‘μΉ˜μ— λŒ€ν•΄ κΈ°λŠ₯적 μ˜μ‘΄μ„±μ„ κ°–κ³  μžˆμ§€ μ•Šλ‹€λŠ” κ°€μ • ν•˜μ—, 첫 번째 νƒ€μž„ μŠ€νƒ­μ˜ 미뢄은 λ‹€μŒκ³Ό κ°™λ‹€.

Equation 18 shows how p^k_uv(t + 1) can be calculated in terms of p^k_uv(t).

방정식 18λ²ˆμ€ p^k_uv(t) 의 κ΄€μ μ—μ„œ μ–΄λ–»κ²Œ p^k_uv(t + 1) κ°€ κ³„μ‚°λ˜λŠ” μ§€λ₯Ό 보여쀀닀.

In this sense, the learning algorithm becomes incremental, so that we can learn as we receive new inputs (in real time), and we no longer need to perform back-propagation through time.

이 κ°œλ…μ—μ„œ, ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ€ 증뢄이 되고(?), μƒˆλ‘œμš΄ μž…λ ₯을 μ‹€μ‹œκ°„μœΌλ‘œ λ°›κΈ° λ•Œλ¬Έμ— ν•™μŠ΅μ‹œν‚¬ 수 μžˆλ‹€. κ·ΈλŸ¬λ―€λ‘œ, 더 이상 μ‹œκ°„ λ‚΄λ‚΄(?) μ—­μ „νŒŒλ₯Ό μˆ˜ν–‰ν•  ν•„μš”κ°€ μ—†μ–΄μ§„λ‹€.

Knowing the initial value for p^k_uv at time t' from Equation 19, we can recursively calculate the quantities p^k_uv for the first and all subsequent time steps using Equation 18.

곡식 19λ‘œλΆ€ν„° μ‹œκ°„ t' μ—μ„œμ˜ p^k_uv 에 λŒ€ν•œ μ΄ˆκΈ°κ°’μ„ μ•„λŠ” 것은, 18번 방정식을 μ‚¬μš©ν•˜λŠ” 첫 λ²ˆμ§Έμ™€ ν›„μ†μ˜ λͺ¨λ“  νƒ€μž„ μŠ€νƒ­μ— λŒ€ν•΄ p^k_uv의 양을 μˆœν™˜μ μœΌλ‘œ 계산할 수 μžˆλ‹€.

Note that p^k_uv(Ο„ ) uses the values of W[u,v] at t', and not values in-between t' and Ο„.

p^k_uv(Ο„) λŠ” νƒ€μž„ t' μ—μ„œμ˜ W[u,v] λ₯Ό μ‚¬μš©ν•˜λŠ” 것이지, νƒ€μž„ t'와 Ο„ μ‚¬μ΄μ˜ 값을 μ‚¬μš©ν•˜λŠ” 것이 μ•„λ‹ˆλΌλŠ” 점을 μ£Όμ˜ν•˜μ—¬μ•Ό ν•œλ‹€.

Combining these values with the error vector e(Ο„ ) for that time step, using Equation 14, we can finally calculate the negative error gradient βˆ‡WE(Ο„ ).

방정식 14λ₯Ό μ‚¬μš©ν•˜μ—¬ νƒ€μž„ μŠ€νƒ­ Ο„ 에 λŒ€ν•œ μ—λŸ¬ 벑터 e(Ο„ ) 값듀을 ν•©μΉ˜λŠ” κ²ƒμœΌλ‘œ, λ§ˆμΉ¨λ‚΄ λ„€κ±°ν‹°λΈŒ μ—λŸ¬ 기울기 βˆ‡WE(Ο„ ) λ₯Ό κ³„μ‚°ν•˜λŠ” 것이 κ°€λŠ₯ν•˜λ‹€.

The final weight change for W[u,v] can be calculated using Equations 14 and 13.

W[u,v]에 λŒ€ν•œ μ΅œμ’… κ°€μ€‘μΉ˜ 변경은 방정식 14와 13을 톡해 계산할 수 μžˆλ‹€.

A more detailed description of the RTRL algorithm is given in [75] and [76].

RTRL μ•Œκ³ λ¦¬μ¦˜μ— λŒ€ν•œ λ”μš± μƒμ„Έν•œ λ¬˜μ‚¬λŠ” [75], [76] 에 λ‚˜μ™€μžˆλ‹€.

dictionary

recurrent: μˆœν™˜ν•˜λŠ” unroll: ν’€λ‹€ stream: (사전적 의미)흐름, (데이터 관점)데이터,νŒ¨ν‚·,λΉ„νŠΈ λ“±μ˜ 일련의 연속성을 κ°–λŠ” 흐름 dedicated: νŠΉμ •ν•œ interval: 간격 obsolete: μ“Έλͺ¨μ—†λŠ”, ν•„μš”μ—†λŠ” notion: κ°œλ… assume: μΆ”μ •ν•˜λ‹€ equation: 방정식 trajectory: λ°©ν–₯, ꢀ적 incremental: 증뢄(?) perform: μ΄ν–‰ν•˜λ‹€, μˆ˜ν–‰ν•˜λ‹€