04 Feed Forward Neural Networks and Backpropagation - PAI-yoonsung/lstm-paper GitHub Wiki
In feed-forward neural networks (FFNNs), sets of neurons are organised in layers, where each neuron computes a weighted sum of its inputs.
feed-forward neural networks (FFNNs) λ λ μ΄μ΄μμ μ‘°μ§νλ λ΄λ°λ€μ λͺ¨μμ λλ€. κ° λ΄λ°μ μ λ ₯κ°λ€μ κ°μ€μΉκ° μ μ©λ ν©μ κ³μ°ν©λλ€.
Input neurons take signals from the environment, and output neurons present signals to the environment.
μ λ ₯ λ΄λ°λ€μ νκ²½μΌλ‘λΆν° μ νΈλ₯Ό λ°κ³ , μΆλ ₯ λ΄λ°λ€μ μ νΈλ€μ νκ²½μ ν₯ν΄ ννν©λλ€.
Neurons that are not directly connected to the environment, but which are connected to other neurons, are called hidden neurons.
λ΄λ°λ€μ νκ²½κ³Ό μ§μ μ°κ²°λμ§ μμ§λ§, νλ λ΄λ°μ΄λΌκ³ λΆλ¦¬μ°λ λ€λ₯Έ λ΄λ°λ€κ³Ό μ°κ²°λ©λλ€.
Feed-forward neural networks are loop-free and fully connected.
Feed-forward neural networks λ λ°λ³΅μ΄ μκ³ , μμ ν μ°κ²°λ©λλ€.
This means that each neuron provides an input to each neuron in the following layer, and that none of the weights give an input to a neuron in a previous layer.
μ΄λ κ° λ΄λ°μ΄ λ€λ₯Έ λ μ΄μ΄μ λ΄λ°λ€μκ² μ λ ₯κ°μ μ 곡νλ€λ κ²μ μλ―Ένκ³ , λͺ¨λ κ°μ€μΉλ€μ μ΄μ λ μ΄μ΄μ λ΄λ°μ μ λ ₯μ μ£Όμ§ μλλ€λ κ²μ λ»ν©λλ€.
The simplest type of neural feed-forward networks are single-layer perceptron networks.
κ°μ₯ λ¨μν ννμ μ κ²½ μμ ν λ€νΈμν¬λ λ¨μΌ λ μ΄μ΄ νΌμ νΈλ‘ λ€νΈμν¬μ λλ€.
Single-layer neural networks consist of a set of input neurons, defined as the input layer, and a set of output neurons, defined as the output layer.
λ¨μΌ λ μ΄μ΄ μ κ²½λ§ λ€νΈμν¬λ λ΄λ°λ€μ μ§ν©μΌλ‘ ꡬμ±λλλ°, μ λ ₯ λ΄λ°λ€μ μ§ν©μΈ μ λ ₯ λ μ΄μ΄μ μΆλ ₯ λ΄λ°λ€μ μ§ν©λ€μΈ μΆλ ₯ λ μ΄μ΄λ‘ ꡬμ±λ©λλ€.
The outputs of the input-layer neurons are directly connected to the neurons of the output layer.
μ λ ₯ λ μ΄μ΄ λ΄λ°λ€μ μΆλ ₯μ μ§μ μ μΌλ‘ μΆλ ₯ λ μ΄μ΄μ λ΄λ°λ€μ ν₯ν΄ μ°κ²°λ©λλ€.
The weights are applied to the connections between the input and output layer.
κ°μ€μΉλ€μ μ λ ₯ λ μ΄μ΄μ μΆλ ₯ λ μ΄μ΄ μ¬μ΄μ μ°κ²°λ€μ μ μ©λ©λλ€.
In the single-layer perceptron network, every single perceptron calculates the sum of the products of the weights and the inputs.
λ¨μΌ λ μ΄μ΄ νΌμ νΈλ‘ λ€νΈμν¬μμλ, κ°κ°μ λ¨μΌ νΌμ νΈλ‘ μ΄ κ°μ€μΉμ μ λ ₯κ° μ¬μ΄μ κ³±(product) μ°μ°μ ν©μ κ³μ°ν©λλ€.
The perceptron fires β1β if the value is above the threshold value;
νΌμ νΈλ‘ μ κ°μ΄ μκ³κ°μ λμ΄κ° κ²½μ°μλ 1 λ‘ νμ±νλ©λλ€.
otherwise, the perceptron takes the deactivated value, which is usually β-1β.
κ·Έ μΈμλ, νΌμ νΈλ‘ μ λΉνμ±ν κ°μ κ°κ² λκ³ , μ΄λλ μ£Όλ‘ -1 μ κ°κ² λ©λλ€.
The threshold value is typically zero.
μκ³κ°μ μΌλ°μ μΌλ‘ 0 μ μ¬μ©ν©λλ€.
Sets of neurons organised in several layers can form multilayer, forwardconnected networks.
μ¬λ¬ λ μ΄μ΄λ‘ μ‘°μ§νλ λ΄λ°μ μ§ν©λ€μ μ λ°©μΌλ‘ μ°κ²°λλ λ€μ€ λ μ΄μ΄ λ€νΈμν¬λ₯Ό νμ±ν μ μμ΅λλ€.
The input and output layers are connected via at least one hidden layer, built from set(s) of hidden neurons.
μ λ ₯, μΆλ ₯ λ μ΄μ΄λ μ΅μ νλμ νλ λ μ΄μ΄λ₯Ό ν΅ν΄ μ°κ²°λκ³ , μ΄ νλ λ μ΄μ΄λ set(s) κ°μ νλ λ΄λ°λ€λ‘ λ§λ€μ΄μ§λλ€.
The multilayer feed-forward neural network sketched in Figure 4, with one input layer and three output layers (two hidden and one output), is classified as a 3-layer feed-forward neural network.
λ€μ€ λ μ΄μ΄ μμ ν μ κ²½λ§μ Figure 4 μ λνλμκ³ , μ΄ μ κ²½λ§μ νλμ μ λ ₯ λ μ΄μ΄μ 3κ°μ μΆλ ₯ λ μ΄μ΄(2κ°μ νλ λ μ΄μ΄μ νλμ μΆλ ₯ λ μ΄μ΄) λ₯Ό κ°κ³ μκΈ° λλ¬Έμ 3-λ μ΄μ΄ μμ ν μ κ²½λ§μΌλ‘ λΆλ₯λ©λλ€.
For most problems, feed-forward neural networks with more than two layers offer no advantage.
λλΆλΆμ λ¬Έμ μμ, λ μ΄μ΄λ₯Ό 2κ° λκ² κ°λ μμ ν μ κ²½λ§μ λ±ν μ΄μ μ μ£Όμ§ λͺ»ν©λλ€.
Multilayer feed-forward networks using sigmoid threshold functions are able to express non-linear decision surfaces.
μκ·Έλͺ¨μ΄λ μκ³ ν¨μλ₯Ό μ¬μ©νλ λ€μ€ λ μ΄μ΄ μμ ν λ§μ λΉμ ν κ²°μ νλ©΄μ νννλ κ²μ΄ κ°λ₯ν©λλ€.
Any function can be closely approximated by these networks, given enough hidden units.
μΆ©λΆν μμ νλ μ λλ€μ΄ μ£Όμ΄μ§λ€λ©΄, μ΄ λ€νΈμν¬λ₯Ό ν΅ν΄ μ΄λ ν ν¨μλ λΉμ·νκ² κ·Όμ¬νλ κ²μ΄ κ°λ₯ν©λλ€.
Figure 4: A multilayer feed-forward neural network with one input layer, two hidden layers, and an output layer. Using neurons with sigmoid threshold functions, these neural networks are able to express non-linear decision surfaces.
Figure 4: ν κ°μ μΈν λ μ΄μ΄, λ κ°μ νλ λ μ΄μ΄, ν κ°μ μΆλ ₯ λ μ΄μ΄λ₯Ό κ°λ λ©ν° λ μ΄μ΄ μμ ν μ κ²½λ§μΌλ‘, λ΄λ°λ€μ μκ·Έλͺ¨μ΄λ μκ³ ν¨μλ₯Ό μ¬μ©νκ³ μκΈ° λλ¬Έμ μ΄ λ΄λ΄ μ κ²½λ§λ€μ λΉμ ν κ²°μ νλ©΄μ ννν μ μμ΅λλ€.
The most common neural network learning technique is the error backpropagation algorithm.
κ°μ₯ νν μ κ²½λ§ νμ΅ κΈ°μ μ μλ¬ μμ ν μκ³ λ¦¬μ¦μ λλ€.
It uses gradient descent to learn the weights in multilayer networks.
ν΄λΉ κΈ°μ μ λ©ν° λ μ΄μ΄ μ κ²½λ§μ κ°μ€μΉλ₯Ό νμ΅νκΈ° μνμ¬ κ·Έλ λμΈνΈ λμΌνΈ λ°©μμ μ¬μ©ν©λλ€.
It works in small iterative steps, starting backwards from the output layer towards the input layer.
ν΄λΉ κΈ°μ μ μΆλ ₯ λ μ΄μ΄λΆν° μμνμ¬ μ λ ₯ λ μ΄μ΄λ₯Ό ν₯ν΄ λμκ°λ μ μ λ°λ³΅ νμλμ μλν©λλ€.
A requirement is that the activation function of the neuron is differentiable.
μμ ν μκ³ λ¦¬μ¦μ΄ μ μ©λκΈ° μν΄μ , λ΄λ°μ νμ±νν¨μκ° λ―ΈλΆμ΄ κ°λ₯ν΄μΌ ν©λλ€.
Usually, the weights of a feed-forward neural network are initialised to small, normalised random numbers using bias values.
μΌλ°μ μΌλ‘, μμ ν μ κ²½λ§μ κ°μ€μΉλ€μ μκ³ , μ κ·νκ° μ μ©λ 무μμμ μ«μλ€λ‘ μ΄κΈ°ν λ©λλ€.
Then, error backpropagation applies all training samples to the neural network and computes the input and output of each unit for all (hidden and) output layers.
κ·Έ λ€μ, μλ¬ μμ νκ° λͺ¨λ νλ ¨ μνλ€μ μ μ©λκ³ , λͺ¨λ μΆλ ₯ λ μ΄μ΄(νλ , μΆλ ₯)μ κ° μ λμ μ λ ₯κ°κ³Ό μΆλ ₯κ°μ κ³μ°ν©λλ€.
μ κ²½λ§ μ λλ€μ μ§ν©μ μμ 곡μκ³Ό κ°κ³ , U λ μλ‘μ ν©μ§ν© I, H, O λ μ λ ₯, νλ , μΆλ ₯ μ λλ€μ μ§ν©μ λλ€.
We denote input units by i, hidden units by h and output units by o.
λν, μ¬κΈ°μλ μ λ ₯ μ λμ i, νλ μ λμ h, μΆλ ₯ μ λμ o λΌκ³ μΉν κ²μ λλ€.
For convenience, we define the set of non-input units U , H t O.
νΈμλ₯Ό μν΄γ ‘ μ°λ¦¬λ non-μ λ ₯ μ λ μ§ν©μ λ€μκ³Ό κ°μ΄ μ μν μ μμ΅λλ€
For a non-input unit u β U, the input to u is denoted by xu, its state by su, its bias by bu and its output by yu.
non-μ λ ₯ μ λ u β U μ λνμ¬, uμ λν μ λ ₯κ°μ x_u λ‘ ννλκ³ , κ·Έκ²μ μνλ s_u λ‘ ννλλ©°, νΈν₯μ b_u λ‘ λνλ΄μ§κ³ , μΆλ ₯μ y_u λΌ μΉν©λλ€.
Given units u, v β U, the weight that connects u with v is denoted by Wuv.
μ£Όμ΄μ§ μ λλ€ u, v β U μ λνμ¬, u μ v λ₯Ό μ°κ²°νλ κ°μ€μΉλ W_uv λ‘ ννλ©λλ€.
To model the external input that the neural network receives, we use the external input vector x = <x1, . . . , xn>.
μ κ²½λ§μ΄ λ°μλ€μ΄λ μΈλΆ μ λ ₯μ λͺ¨λΈλ§νκΈ° μν΄, μ°λ¦¬λ μΈλΆ μ λ ₯ λ²‘ν° x = <x1, . . . , xn> λ₯Ό μ¬μ©ν©λλ€.
For each component of the external input vector we find a corresponding input unit that models it, so the output of the i^th input unit should be equal i^th component of the input to the network (i.e., xi), and consequently |I| = n.
κ° μΈλΆ μ λ ₯ 벑ν°μ ꡬμ±μμλ₯Ό λͺ¨λΈλ§νλ μνΈ μ λ ₯ μ λμ μ°Ύμ΅λλ€, κ·Έλ¬λ―λ‘ i^th λ²μ§Έ μ λ ₯ μ λμ μΆλ ₯μ λ€νΈμν¬μ λν μ λ ₯μ i^th λ²μ§Έ ꡬμ±μμμ λμΌν΄μΌ νκ³ (i.e., xi) κ²°λ‘ μ μΌλ‘ |I| = n μ΄ λ©λλ€.
non μ λ ₯ μ λ u β U μ λν΄, y_u λΌκ³ λͺ μλ u μ μΆλ ₯μ, (1)μ μκ³Ό κ°μ΄ μκ·Έλͺ¨μ΄λ νμ±ν ν¨μλ₯Ό μ¬μ©νλ κ²μΌλ‘ μ μλλ€. s_u λ u μ μνλ₯Ό μλ―Ένκ³ , μ΄κ²μ (2)μ μκ³Ό κ°μ΄ μ μλλ€. b_u λ u μ νΈν₯μ μλ―Ένκ³ , z_u λ uμ κ°μ€μΉκ° μ μ©λ μ λ ₯μ λ»νλ€. μ΄λ (3)μ μμ²λΌ ννλ μ μλ€.
where X[v,u] is the information that v passes as input to u, and Pre (u) is the set of units v that preceed u; that is, input units, and hidden units that feed their outputs yv (see Equation (1)) multiplied by the corresponding weight W[v,u] to the unit u.
X[v,u] λ vκ° uλ₯Ό ν₯ν μ λ ₯μΌλ‘ λ€μ΄κ°λ€λ μ 보λ₯Ό μλ―Ένκ³ , Pre(u) λ uλ³΄λ€ μμκ°λ μ λ vλ€μ μ§ν©μ΄λ€. μ¦, μ λ u λ₯Ό ν₯ν΄ μΆλ ₯ y_vλ₯Ό μνΈλλ κ°μ€μΉ W[v,u] λ₯Ό κ³±νμ¬ λκΈ°λ μ λ ₯ μ λλ€κ³Ό νλ μ λλ€μ λ»νλ€.
Starting from the input layer, the inputs are propagated forwards through the network until the output units are reached at the output layer.
μ λ ₯ λ μ΄μ΄λΆν° μμνμ¬, μ λ ₯λ€μ λ€νΈμν¬λ₯Ό ν΅ν΄ μΆλ ₯ μ λλ€μ΄ μΆλ ₯ λ μ΄μ΄μ λλ¬ν λ κΉμ§ μμΌλ‘ μ λ¬νκ² λ©λλ€.
Then, the output units produce an observable output (the network output) y.
κ·Έ ν, μΆλ ₯ μ λλ€μ κ΄μ°°κ°λ₯ν μΆλ ₯(λ€νΈμν¬ μΆλ ₯) y λ₯Ό λ§λ€κ² λ©λλ€.
More precisely, for o β O, its output yo corresponds to the o th component of y.
λ μ ννκ²λ, o β O μ λν΄μ, μΆλ ₯ y_o μ yμ o λ²μ§Έ κ΅¬μ± μμμ μνΈμμ© ν©λλ€.
Next, the backpropagation learning algorithm propagates the error backwards, and the weights and biases are updated such that we reduce the error with respect to the present training sample.
λ€μμΌλ‘, μμ ν νμ΅ μκ³ λ¦¬μ¦μ μλ¬λ₯Ό νλ°©μΌλ‘ μ λ¬νκ³ , κ°μ€μΉμ νΈν₯μ νμ¬ νλ ¨ μνμμ μλ¬λ₯Ό μ€μ΄λλ‘ μ λ°μ΄νΈν©λλ€.
Starting from the output layer, the algorithm compares the network output yo with the corresponding desired target output do.
ν΄λΉ μκ³ λ¦¬μ¦μ μΆλ ₯ λ μ΄μ΄λΆν° μμνμ¬, λ€νΈμν¬μ μΆλ ₯μΈ y_o μ μνΈλλ λͺ©ν μΆλ ₯μΈ d_o μ λΉκ΅νκ² λ©λλ€.
It calculates the error e_o for each output neuron using some error function to be minimised.
ν΄λΉ μκ³ λ¦¬μ¦μ κ° μΆλ ₯ λ΄λ°μ μλ¬ e_o λ₯Ό μ΅μννκΈ° μν λͺλͺ μλ¬ ν¨μλ₯Ό μ¬μ©νμ¬ κ³μ°ν©λλ€.
μλ¬ e_o μ μ κ·Έλ¦Όμ 첫 λ²μ§Έ 곡μμΌλ‘ κ³μ°λκ³ , λ€νΈμν¬μ μ λ°μ μΈ μλ¬λ₯Ό κ³μ°νκΈ° μνμ¬ λ λ²μ§Έ 곡μμ μ¬μ©ν©λλ€.
κ°μ€μΉ W[u,v] λ₯Ό κ°±μ νκΈ° μν΄, μ°λ¦¬λ μ κ·Έλ¦Όμμ 첫 λ²μ§Έ 곡μμ μ¬μ©ν κ²μ λλ€. Ξ· λ λ¬λ λ μ΄νΈλ₯Ό λ»ν©λλ€. μ΄μ μ°λ¦¬λ μμ(μ κ·Έλ¦Όμμ νμ y, νμ s)λ₯Ό νμ©νμ¬ νμ±νμ κ΄λ ¨λ μλ¬λ₯Ό λ―ΈλΆνλ κ²μ κ°μ€μΉλ₯Ό κ³μ°νκ³ ,
μν λ° κ°μ€μΉμ λν μ£Όμ νμμνμ κ³μ°ν©λλ€.(?)
μΆλ ₯ μ λμ νμ±νμ κ΄λ ¨λ μλ¬μ νμ ν¨μλ μμμ 첫 λ²μ§Έ μκ³Ό κ°μ΅λλ€. μΆλ ₯ μ λμ λν μνμ κ΄λ ¨λ νμ±νμ νμ ν¨μλ λ λ²μ§Έ μκ³Ό κ°μ΅λλ€.
μΆλ ₯ μ λμ o μ λνμ¬, μλ¬ μ νΈλ (4) μκ³Ό κ°μ΅λλ€ μΆλ ₯ μ λλ€μλν΄μ, μ°λ¦¬λ (5) μμ κ°μ΅λλ€. λν νλ μ λ h μ μΆλ ₯ μ λ o μ¬μ΄μ κ°μ€μΉλ₯Ό λ€μκ³Ό κ°μ΄ μ λ°μ΄νΈν μ μμ΅λλ€.
dictionary
organised : μ‘°μ§νλλ€ iterative : λ°λ³΅μ μΈ differentiable : λ―ΈλΆ κ°λ₯ν preceed : μμλ€ propagate : μ λ¬νλ€