01 Introduction - PAI-yoonsung/lstm-paper GitHub Wiki

1 Introduction

This article is an tutorial-like introduction initially developed as supplementary material for lectures focused on Artificial Intelligence.

์ด ์•„ํ‹ฐํด์€ ์ธ๊ณต์ง€๋Šฅ ๊ฐ•์˜๋ฅผ ๋ชฉ์ ์œผ๋กœ ํ•œ ๋ณด์กฐ ์ž๋ฃŒ๋กœ์จ ์ œ์ž‘๋œ ํŠœํ† ๋ฆฌ์–ผ ๊ฐ™์€ ์†Œ๊ฐœ๋ฌธ์ž…๋‹ˆ๋‹ค.

The interested reader can deepen his/her knowledge by understanding Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN) considering its evolution since the early nineties.

์ธ๊ณต์ง€๋Šฅ์— ๊ด€์‹ฌ์ด ์žˆ๋Š” ๋…์ž๋“ค์ด๋ผ๋ฉด, LSTM-RNN ์˜ 90๋…„๋Œ€ ์ดˆ๋ถ€ํ„ฐ์˜ ์ง„ํ™”์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๋Š” ๊ฒƒ์œผ๋กœ ๋” ๊นŠ์€ ์ง€์‹์„ ์Šต๋“ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Todays publications on LSTM-RNN use a slightly different notation and a much more summarized representation of the derivations.

์˜ค๋Š˜๋‚ ์˜ LSTM-RNN ๋ฐœํ‘œ๋ฌผ๋“ค์€ ์•ฝ๊ฐ„ ๋‹ค๋ฅธ ํ‘œ๊ธฐ๋ฒ•๊ณผ ๋”์šฑ ์ถ•์•ฝ๋œ ํŒŒ์ƒ๋ฌผ๋“ค์˜ ํ‘œ์‹œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Nevertheless the authors found the presented approach very helpful and we are confident this publication will find its audience.

ํ•˜์ง€๋งŒ, ๊ธ€์“ด์ด๋Š” ์ œ๊ณต๋˜๋Š” ์ ‘๊ทผ๋ฒ•์ด ๋งค์šฐ ๋„์›€์ด ๋˜๊ณ  ์ด ๋ฐœํ‘œ๋ฌผ์ด ๋“ฃ๋Š” ์ด๋“ค์„ ์ฐพ์„ ๊ฒƒ์ด๋ผ ์ž์‹ ํ•ฉ๋‹ˆ๋‹ค.

Machine learning is concerned with the development of algorithms that automatically improve by practice. Ideally, the more the learning algorithm is run, the better the algorithm becomes.

๋จธ์‹  ๋Ÿฌ๋‹์€ ์‹ค์šฉ์ ์œผ๋กœ ๋ฐœ์ „๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ๊ฐœ๋ฐœ๊ณผ ํ•จ๊ป˜ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ณ ๋ ค๋์Šต๋‹ˆ๋‹ค. ์ด์ƒ์ ์œผ๋กœ, ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์‹คํ–‰๋ ์ˆ˜๋ก ๋” ๋‚˜์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

It is the task of the learning algorithm to create a classifier function from the training data presented.

์ด๊ฒƒ์€ ์ฃผ์–ด์ง„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋ถ„๋ฅ˜ ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ…Œ์Šคํฌ์ž…๋‹ˆ๋‹ค.

The performance of this built classifier is then measured by applying it to previously unseen data.

์ด ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์€ ์ด์ „์— ๋ชจ๋ธ์ด ํ•™์Šตํ•œ ์ ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ ์šฉ์‹œ์ผœ๋ณด๋Š” ๊ฒƒ์œผ๋กœ ์ธก์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Artificial Neural Networks (ANN) are inspired by biological learning systems and loosely model their basic functions.

ANN ์€ ์ƒ๋ฌผํ•™์  ํ•™์Šต ์‹œ์Šคํ…œ๋“ค๋กœ๋ถ€ํ„ฐ ์˜๊ฐ์„ ๋ฐ›์•„, ์ด๋“ค์˜ ๊ธฐ์ดˆ์ ์ธ ๊ธฐ๋Šฅ๋“ค์„ ๋А์Šจํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•ฉ๋‹ˆ๋‹ค.

Biological learning systems are complex webs of interconnected neurons.

์ƒ๋ฌผํ•™์  ํ•™์Šต ์‹œ์Šคํ…œ์€ ๋‰ด๋Ÿฐ๋“ค์ด ์„œ๋กœ ์—ฐ๊ฒฐ๋œ ๋ณต์žกํ•œ ๊ทธ๋ฌผ๋ง์ž…๋‹ˆ๋‹ค.

Neurons are simple units accepting a vector of real-valued inputs and producing a single real-valued output.

๋‰ด๋Ÿฐ๋“ค์€ ์‹ค์ˆ˜ ์ž…๋ ฅ๊ฐ’์˜ ๋ฒกํ„ฐ๋ฅผ ๋ฐ›์•„๋“ค์ด๊ณ  ๋‹จ์ผ ์‹ค์ˆ˜ ์ถœ๋ ฅ๊ฐ’์„ ์ƒ์‚ฐํ•˜๋Š” ๊ฐ„๋‹จํ•œ ์œ ๋‹›๋“ค์ž…๋‹ˆ๋‹ค.

The most common standard neural network type are feed-forward neural networks.

๊ฐ€์žฅ ํ”ํ•˜๊ณ  ๊ธฐ๋ณธ์ ์ธ ์‹ ๊ฒฝ๋ง ํƒ€์ž…์€ feed-forward ์‹ ๊ฒฝ๋ง์ž…๋‹ˆ๋‹ค.

Here sets of neurons are organised in layers: one input layer, one output layer, and at least one intermediate hidden layer.

๋‰ด๋Ÿฐ์˜ ์„ธํŠธ๋“ค์€ ๋‹ค์Œ์˜ ๋ ˆ์ด์–ด๋“ค๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค: input ๋ ˆ์ด์–ด ํ•˜๋‚˜, output ๋ ˆ์ด์–ด ํ•˜๋‚˜, ์ตœ์†Œ ํ•œ ๊ฐœ์˜ ์ค‘๊ฐ„ hidden ๋ ˆ์ด์–ด

Feed-forward neural networks are limited to static classification tasks.

Feed-forward ์‹ ๊ฒฝ๋ง์€ ์ •์ ์ธ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋“ค์—๋งŒ ๊ตญํ•œ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

Therefore, they are limited to provide a static mapping between input and output.

๋•Œ๋ฌธ์—, ํ•ด๋‹น ์‹ ๊ฒฝ๋ง์€ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ์‚ฌ์ด์˜ ์ •์  ๋งคํ•‘์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ ์—ญํ• ์ด ์ œํ•œ๋ฉ๋‹ˆ๋‹ค.

To model time prediction tasks we need a so-called dynamic classifier.

์‹œ๊ฐ„ ์˜ˆ์ธก ๋ฌธ์ œ๋ฅผ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด์„ , ๋™์  ๋ถ„๋ฅ˜๊ธฐ๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ๊ฒƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

We can extend feed-forward neural networks towards dynamic classification.

์šฐ๋ฆฌ๋Š” feed-forward ์‹ ๊ฒฝ๋ง์„ ๋™์  ๋ถ„๋ฅ˜๊ฐ€ ๊ฐ€๋Šฅํ•˜๋„๋ก ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

To gain this property we need to feed signals from previous timesteps back into the network.

์‹ ๊ฒฝ๋ง์ด ์ด๋Ÿฌํ•œ ์†์„ฑ์„ ๊ฐ–๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ์ด์ „์˜ ํƒ€์ž„์Šคํƒญ๋“ค๋กœ๋ถ€ํ„ฐ์˜ ์‹ ํ˜ธ๋“ค์„ ๋„คํŠธ์›Œํฌ๋กœ ๋‹ค์‹œ ๋˜๋Œ๋ ค์ค„ ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

These networks with recurrent connections are called Recurrent Neural Networks (RNN) [74], [75].

์ด๋Ÿฌํ•œ ๋˜ํ’€์ดํ•˜๋Š” ์—ฐ๊ฒฐ์„ ๊ฐ–๋Š” ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ RNN ์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

RNNs are limited to look back in time for approximately ten timesteps [38], [56].

RNN ๊ตฌ์กฐ๋Š” ์•ฝ 10๊ฐœ์˜ ํƒ€์ž„์Šคํƒญ ์ •๋„๋งŒ ๋˜๋Œ์•„๊ฐˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

This is due to the fed back signal is either vanishing or exploding.

์™œ๋ƒํ•˜๋ฉด ์‘๋‹ต์œผ๋กœ ๋ฐ›๋Š” ์‹ ํ˜ธ๋“ค์ด ์‚ฌ๋ผ์ ธ๊ฐ€๊ฑฐ๋‚˜ ์ฆํญ๋ผ๋ฒ„๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

This issue was addressed with Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN) [22], [41], [23], [60].

ํ•ด๋‹น ์ด์Šˆ๋Š” LSTM RNN ๊ตฌ์กฐ์˜ ๋“ฑ์žฅ์œผ๋กœ ํ•ด๊ฒฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

LSTM networks are to a certain extend biologically plausible [58] and capable to learn more than 1,000 timesteps, depending on the complexity of the built network[41].

LSTM ์‹ ๊ฒฝ๋ง์€ ์ƒ๋ฌผํ•™์ ์œผ๋กœ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋„๋ก, ๋˜ํ•œ ๋„คํŠธ์›Œํฌ์˜ ๋ณต์žก๋„์— ๋”ฐ๋ผ์„œ 1,000 ํƒ€์ž„์Šคํƒญ ์ด์ƒ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ™•์žฅ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

In the early, ground-breaking papers by Hochreiter [41] and Graves [34], the authors used different notations which made further development prone to errors and inconvenient to follow.

์ด์ „์˜ Hochreiter ์™€ Graves ์˜ ํš๊ธฐ์ ์ธ ๋…ผ๋ฌธ์—์„œ๋Š”, ์ž‘๊ฐ€๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ํ‘œ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ์„œ ์‹ฌํ™” ๊ฐœ๋ฐœ ์‹œ ์—๋Ÿฌ๋ฅผ ์ผ์œผํ‚ค๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์—ˆ๊ณ , ๋”ฐ๋ผ๊ฐ€๊ธฐ ๋ถˆํŽธํ•œ ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

To address this we developed a unified notation and did draw descriptive figures to support the interested reader in understanding the related equations of the early publications.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ํ†ต์ผํ™”๋œ ํ‘œ๊ธฐ๋ฒ•์„ ๊ฐœ๋ฐœํ•˜์˜€๊ณ , ๋…์ž๋“ค์ด ์ดˆ๊ธฐ ๋ฐœํ‘œ๋ฌผ๋“ค๊ณผ ์—ฐ๊ด€๋œ ๋ฐฉ์ •์‹์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๋•๋„๋ก ๊ธฐ์ˆ ์ ์ธ ๋„ํ˜•๋“ค์„ ๊ทธ๋ ธ์Šต๋‹ˆ๋‹ค.

In the following, we slowly dive into the world of neural networks and specifically LSTM-RNNs with a selection of its most promising extensions documented so far.

์ดํ›„์—๋Š”, ์šฐ๋ฆฌ๋Š” ์‹ ๊ฒฝ๋ง์˜ ์„ธ๊ณ„, ๊ทธ ์ค‘์—์„œ๋„ ํŠนํžˆ ๊ฐ€์žฅ ์ด‰๋ง๋ฐ›๋Š” ๋ฌธ์„œ ํ™•์žฅ์„ฑ์„ ์ง€๋‹Œ LSTM-RNN์„ ์ค‘์‹ฌ์œผ๋กœ ์ฒœ์ฒœํžˆ ์ž ์ˆ˜ํ•ด๋ณผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

We successively explain how neural networks evolved from a single perceptron to something as powerful as LSTM.

์šฐ๋ฆฌ๋Š” ์„ฑ๊ณต์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ์‹ ๊ฒฝ๋ง์ด ๋‹จ์ผ ํผ์…‰ํŠธ๋ก ์—์„œ LSTM๊ณผ๋„ ๊ฐ™์ด ๊ฐ•๋ ฅํ•˜๊ฒŒ ์ง„ํ™”ํ–ˆ๋Š”์ง€์— ๋Œ€ํ•˜์—ฌ ์„ค๋ช…ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

This includes vanilla LSTM, although not used in practice anymore, as the fundamental evolutionary step.

์ด๊ฒƒ์€ ๊ธฐ๋ณธ LSTM ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๋” ์ด์ƒ ์‹ค์ „์—์„œ ์‚ฌ์šฉ๋˜์ง€๋Š” ์•Š์ง€๋งŒ, ๊ธฐ๋ณธ์ ์ธ ์ง„ํ™” ๋‹จ๊ณ„๋กœ์จ ์•Œ์•„๋ณผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

With this article, we support beginners in the machine learning community to understand how LSTM works with the intention motivate its further development.

์ด ์•„ํ‹ฐํด๋กœ, ์šฐ๋ฆฌ๋Š” ๋จธ์‹  ๋Ÿฌ๋‹ ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ์ดˆ์‹ฌ์ž๋“ค์ด LSTM์˜ ์‹ฌํ™” ๊ฐœ๋ฐœ ๋™๊ธฐ์™€ ํ•จ๊ป˜ ์–ด๋–ป๊ฒŒ ์ž‘๋™๋˜๋Š” ์ง€์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๋„์šธ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

This is the first document that covers LSTM and its extensions in such great detail.

์ด๊ฒƒ์€ LSTM๊ณผ ํ™•์žฅ์„ฑ์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์„œ์ˆ ํ•œ ์ฒซ ๋ฒˆ์งธ ๋ฌธ์„œ์ž…๋‹ˆ๋‹ค.

dictionary

supplementary material : ๋ณด์กฐ ์ž๋ฃŒ
notation : ํ‘œ๊ธฐ๋ฒ•
derivations : ํŒŒ์ƒ๋ฌผ(?)
Ideally : ์ด์ƒ์ ์œผ๋กœ
loosely : ๋А์Šจํ•˜๊ฒŒ
recurrent : ๋˜ํ’€์ดํ•˜๋Š”, ์žฌ๋ฐœํ•˜๋Š”
was addressed : ํ•ด๊ฒฐ๋˜๋‹ค
prone : ~ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋Š”
intention : ์˜๋„