08 Long Short Term Neural Networks - PAI-yoonsung/lstm-paper GitHub Wiki

8 Long Short-Term Neural Networks

One solution that addresses the vanishing error problem is a gradient-based method called long short-term memory (LSTM) published by [41], [42], [22] and [23].

λ°°λ‹ˆμ‹± μ—λŸ¬ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•œ 방법 쀑 ν•˜λ‚˜λŠ” long short-term memory (LSTM) 라고 λΆˆλ¦¬μš°λŠ” 기울기-기반 λ©”μ†Œλ“œμ΄λ‹€. μ΄λŠ” [41], [42], [22], [23] 을 톡해 κ³΅κ°œλ˜μ—ˆλ‹€.

LSTM can learn how to bridge minimal time lags of more than 1,000 discrete time steps.

LSTM 은 1,000 번 μ΄μƒμ˜ λ³„κ°œμ˜ νƒ€μž„ μŠ€νƒ­μ˜ μ΅œμ†Œ μ‹œκ°„ 렉을 μ–΄λ–»κ²Œ μ—°κ²°ν•˜λŠ”μ§€ 배울 수 μžˆλ‹€. (?)

The solution uses constant error carousels (CECs), which enforce a constant error flow within special cells.

이 해결법은 constant error carousels (CECs) 을 μ‚¬μš©ν•˜κ³ , μ΄λŠ” μ •μ˜€μ°¨κ°€ νŠΉμˆ˜ν•œ μ…€λ“€ 사이λ₯Ό 흐λ₯΄λ„둝 ν•œλ‹€. (?)

Access to the cells is handled by multiplicative gate units, which learn when to grant access.

μ…€λ“€μ—κ²Œ μ ‘μ†ν•˜λŠ” 것은 접속 ν—ˆμš©μ„ μ–Έμ œν•  μ§€λ₯Ό ν•™μŠ΅ν•˜λŠ” κ³±μ…‰ 게이트 μœ λ‹›λ“€μ— μ˜ν•΄ μ‘°μ ˆλœλ‹€

8.1 Constant Error Carousel

Suppose that we have only one unit u with a single connection to itself.

μš°λ¦¬κ°€ 자기 μžμ‹ κ³Όμ˜ 단일 연결을 κ°–κ³ μžˆλŠ” μœ λ‹› u ν•˜λ‚˜λ§Œ κ°–κ³  μžˆλ‹€κ³  κ°€μ •ν•΄λ³΄μž.

The local error back flow of u at a single time-step Ο„ follows from Equation 20 and is given by

u의 단일 νƒ€μž„μŠ€νƒ­ Ο„μ—μ„œμ˜ μ§€μ—­ μ—λŸ¬ μ—­ 흐름은 곡식 20λ²ˆμ— μ˜ν•΄ μœ λ„λ˜κ³ , μ΄λŠ” λ‹€μŒκ³Ό κ°™λ‹€.

From Equations 22 and 23 we see that, in order to ensure a constant error flow through u, we need to have

μš°λ¦¬κ°€ λ³Έ 곡식 22번, 23λ²ˆμœΌλ‘œλΆ€ν„°, uλ₯Ό ν†΅ν•œ μ •μ˜€μ°¨ 흐름을 ν™•μ‹ ν•˜κΈ°μœ„ν•΄ μš°λ¦¬λŠ” λ‹€μŒμ„ κ°€μ Έμ•Ό ν•œλ‹€.

and by integration we have

λ˜ν•œ 적뢄을 톡해, μš°λ¦¬λŠ” λ‹€μŒμ„ κ°€μ§€κ²Œ λœλ‹€.

From this, we learn that f_u must be linear, and that u’s activation must remain constant over time; i.e.,

이λ₯Ό λ°”νƒ•μœΌλ‘œ, μš°λ¦¬λŠ” f_u κ°€ λ°˜λ“œμ‹œ μ„ ν˜•μ΄μ–΄μ•Ό λœλ‹€λŠ” 점과, u 의 μ•‘ν‹°λ² μ΄μ…˜μ΄ μ‹œκ°„μ΄ μ§€λ‚˜λ„ μΌμ •ν•˜κ²Œ λ‚¨μ•„μ•Όλ§Œ ν•œλ‹€λŠ” 것을 배울 수 μžˆλ‹€.

This is ensured by using the identity function f_u = id, and by setting W[u,u] = 1.0.

μ΄λŠ” ν•­λ“± ν•¨μˆ˜ f_u = id λ₯Ό μ‚¬μš©ν•˜λŠ” 것과 W[u,u] = 1.0 λ₯Ό μ„€μ •ν•˜λŠ” κ²ƒμœΌλ‘œ 확보될 수 μžˆλ‹€.

This preservation of error is called the constant error carousel (CEC), and it is the central feature of LSTM, where short-term memory storage is achieved for extended periods of time.

μ΄λŸ¬ν•œ μ—λŸ¬ μ˜ˆλ°©μ„ constant error carousel (CEC) 라고 λΆ€λ₯΄κ³ , μ΄λŠ” short-term ν•œ λ©”λͺ¨λ¦¬ μ €μž₯곡간이 μ—°μž₯된 μ‹œκ°„μ„ κ°€μ§ˆ 수 μžˆκ²Œν•΄μ£ΌλŠ” LSTM 의 쀑심적인 νŠΉμ§•μ΄λ‹€.

Clearly, we still need to handle the connections from other units to the unit u, and this is where the different components of LSTM networks come into the picture.

λΆ„λͺ…νžˆ, μš°λ¦¬λŠ” μ—¬μ „νžˆ μœ λ‹› u둜 ν–₯ν•˜λŠ” λ‹€λ₯Έ μœ λ‹›λ“€μ΄ 연결을 μ‘°μ ˆν•  ν•„μš”κ°€ μžˆμ§€λ§Œ, μ΄λŠ” LSTM λ„€νŠΈμ›Œν¬μ˜ λ‹€λ₯Έ κ΅¬μ„±μš”μ†Œμ—μ„œ μ²˜λ¦¬ν•  것이닀.

8.2 Memory blocks

In the absence of new inputs to the cell, we now know that the CEC’s backflow remains constant.

셀을 ν–₯ν•˜λŠ” μƒˆλ‘œμš΄ μž…λ ₯이 없을 λ•Œ, μš°λ¦¬λŠ” CEC의 역흐름이 지속적일 수 μžˆλ‹€λŠ” 사싀을 μ•Œμ•˜λ‹€.

However, as part of a neural network, the CEC is not only connected to itself, but also to other units in the neural network.

κ·ΈλŸ¬λ‚˜, μ‹ κ²½λ§μ˜ μΌλΆ€λ‘œμ¨, CEC λŠ” 자기 μžμ‹ κ³Ό 연결될 뿐 μ•„λ‹ˆλΌ, 신경망 λ‚΄μ˜ λ‹€λ₯Έ μœ λ‹›λ“€κ³Όλ„ μ—°κ²°λœλ‹€.

We need to take these additional weighted inputs and outputs into account.

μ΄λŸ¬ν•œ 좔가적인 κ°€μ€‘μΉ˜κ°€ 적용된 μž…λ ₯κ³Ό 좜λ ₯을 κ³ λ €ν•΄μ•Ό ν•œλ‹€.

Incoming connections to neuron u can have conflicting weight update signals, because the same weight is used for storing and ignoring inputs.

μœ λ‹› u 둜 λ“€μ–΄μ˜€λŠ” 연결듀은 μΆ©λŒν•˜λŠ” κ°€μ€‘μΉ˜ κ°±μ‹  μ‹ ν˜Έλ“€μ„ κ°€μ§ˆ 수 μžˆλŠ”λ°, μ΄λŠ” 같은 κ°€μ€‘μΉ˜λŠ” μ €μž₯κ³Ό μž…λ ₯듀을 λ¬΄μ‹œν•˜κΈ° μœ„ν•΄ μ‚¬μš©λ˜κΈ° λ•Œλ¬Έμ΄λ‹€.

For weighted output connections from neuron u, the same weights can be used to both retrieve u’s contents and prevent u’s output flow to other neurons in the network.

μœ λ‹› u λ‘œλΆ€ν„°μ˜ κ°€μ€‘μΉ˜κ°€ 적용된 좜λ ₯ 연결듀에 λŒ€ν•΄μ„œ, 같은 κ°€μ€‘μΉ˜λ“€μ€ u 의 λ‚΄μš©μ„ λ³΅κ΅¬ν•˜κ±°λ‚˜ y의 좜λ ₯을 신경망 λ‚΄ λ‹€λ₯Έ λ‰΄λŸ°λ“€λ‘œ 흐λ₯΄κ²Œ ν•  λ•Œ μ‚¬μš©λ  수 μžˆλ‹€.

To address the problem of conflicting weight updates, LSTM extends the CEC with input and output gates connected to the network input layer and to other memory cells.

μΆ©λŒν•˜λŠ” κ°€μ€‘μΉ˜ κ°±μ‹  문제λ₯Ό 닀루기 μœ„ν•΄, LSTM 은 λ„€νŠΈμ›Œν¬ μž…λ ₯ λ ˆμ΄μ–΄μ™€ λ‹€λ₯Έ λ©”λͺ¨λ¦¬ 셀듀에 μ—°κ²°λœ μž…λ ₯, 좜λ ₯ κ²Œμ΄νŠΈλ“€λ‘œ CEC λ₯Ό μ—°μž₯ν•œλ‹€.

This results in a more complex LSTM unit, called a memory block; its standard architecture is shown in Figure 11.

이 결과둜 λ§Œλ“€μ–΄μ§€λŠ” 쑰금 더 λ³΅μž‘ν•œ LSTM μœ λ‹›μ„ λ©”λͺ¨λ¦¬ 블둝이라고 λΆ€λ₯Έλ‹€; μ΄κ²ƒμ˜ 일반적인 κ΅¬μ‘°λŠ” Figure 11 에 λ‚˜μ™€μžˆλ‹€.

The input gates, which are simple sigmoid threshold units with an activation function range of [0, 1], control the signals from the network to the memory cell by scaling them appropriately; when the gate is closed, activation is close to zero.

ν™œμ„±ν™” ν•¨μˆ˜ λ²”μœ„ [0, 1] 을 κ°–λŠ” κ°„λ‹¨ν•œ μ‹œκ·Έλͺ¨μ΄λ“œ μž„κ³„ μœ λ‹›μΈ μž…λ ₯ κ²Œμ΄νŠΈλŠ” μ‹ ν˜Έλ₯Ό λ„€νŠΈμ›Œν¬μ—μ„œ λ©”λͺ¨λ¦¬μ…€λ‘œ κ°€λŠ” μ‹ ν˜Έλ₯Ό μ μ ˆν•˜κ²Œ μŠ€μΌ€μΌλ§ν•΄μ£ΌλŠ” κ²ƒμœΌλ‘œ μ»¨νŠΈλ‘€ν•©λ‹ˆλ‹€; κ²Œμ΄νŠΈκ°€ λ‹«νžˆλ©΄, μ•‘ν‹°λ² μ΄μ…˜μ€ 0 에 κ·Όμ ‘ν•˜κ²Œ λ©λ‹ˆλ‹€.

Additionally, these can learn to protect the contents stored in u from disturbance by irrelevant signals.

μΆ”κ°€μ μœΌλ‘œ, 이듀은 u 에 μ €μž₯된 λ‚΄μš©λ“€μ„ μƒκ΄€μ—†λŠ” μ‹ ν˜Έμ˜ λ°©ν•΄λ‘œλΆ€ν„° μ§€ν‚€λŠ” 것을 배울 수 μžˆλ‹€.

The activation of a CEC by the input gate is defined as the cell state.

μž…λ ₯ κ²Œμ΄λ“œμ— μ˜ν•œ CEC 의 ν™œμ„±ν™”λŠ” μ…€ μƒνƒœμ— μ˜ν•΄ μ •μ˜λœλ‹€.

The output gates can learn how to control access to the memory cell contents, which protects other memory cells from disturbances originating from u.

좜λ ₯ κ²Œμ΄νŠΈλ“€ λ©”λͺ¨λ¦¬ μ…€ λ‚΄μš©μ„ ν–₯ν•œ 접근을 μ–΄λ–»κ²Œ μ»¨νŠΈλ‘€ν•˜λŠ”μ§€ 배울 수 μžˆλ‹€. 이λ₯Ό 톡해, u λ‘œλΆ€ν„° μ‹œμž‘λ˜λŠ” λ°©ν•΄λ“€λ‘œλΆ€ν„° λ‹€λ₯Έ λ©”λͺ¨λ¦¬μ…€λ“€μ„ 지킨닀.

So we can see that the basic function of multiplicative gate units is to either allow or deny access to constant error flow through the CEC.

즉, κ³±μ…‰ 게이트 μœ λ‹›λ“€μ˜ κΈ°λ³Έ κΈ°λŠ₯은 CEC λ₯Ό 톡해 μƒμˆ˜ μ—λŸ¬ 흐름에 λŒ€ν•œ 접근을 ν—ˆλ½ν•˜κ±°λ‚˜ κ±°λΆ€ν•˜λŠ” 것이닀.

dictionary

discrete: λΆ„λ¦¬λœ, κ΅¬λ³„λœ, κ°œλ³„μ μΈ enforce: κ°•μ œν•˜λ‹€, μ–΅μ§€λ‘œ μ‹œν‚€λ‹€, μ‹œν–‰ν•˜λ‹€ constant error: μ •μ˜€μ°¨(?) take ~into account: ~λ₯Ό κ³ λ €ν•˜λ‹€ retrieve: λ³΅κ΅¬ν•˜λ‹€ disturbance: λ°©ν•΄ irrelevant: μƒκ΄€μ—†λŠ”