Deep Learning - jjin-choi/study_note GitHub Wiki
Mini-batch gradient descent
-
Vectorization allows you to efficiently compute on m examples.
- X = [x^(1) x^(2) ... x^(m)] / size: (n_x, m)
- Y = [y^(1) y^(2) ... y^(m)] / size: (1, m)
-
Let's say that you split up your training set into smaller, little baby training sets and these baby training sets are called mini-batches.
- X = [ X^{1} X^{2} ... X^{5000}] where X^{t}: x^(1) x^(2) ... x^(M), M: mini-batch size
※ ( ) superscript round brackets x^(i) : i-th training sample
superscript square brackets z^[l] : l-th layer of the neural network
curly brackets X^{t} : t-th mini batches
Understanding mini-batch gradient descent
-
Training with mini batch gradient descent
- cost function J^{t} using just X^{t} Y^{t}
- it should trend downwards, but it's also going to be a little bit noisier.
-
the parameters you need to choose is size of your mini batch
- if mini-batch size = m : Batch gradient descent, (X^{1}, Y^{1}) = (X, Y)
- if mini-batch size = 1 : Stochastic gradient descent, but lose almost all your speed up from vectorization.
- In practice : somewhere in between 1 and m, getting a lot of vectorization and making progress without needing to wait until you process the entire training set.
- if small test set (m <= 2000) use batch gradient descent
- typical mini-batch sizes would be anything from 64 up to maybe 512 power of 2.
- one last tip is to make sure that your mini-batch, all of your X^{t}, Y^{t} that fits in CPU/GPU memory.
Understanding exponentially weighted averages (지수적 가중 평균)
- Exponentially weighted averages +
Notation
- x^ : 어떤 sequence의 t 위치에 있는 것을 의미
- T_x : 입력 sequence의 길이
- T_y : 출력 sequence의 길이
- X^(i) : i번째 training example
- X^(i) : i번째 training example의 t번째 요소
- apple이라는 단어가 사전에 100번째에 위치한다면, X^(i) = [0 0 ... 1 0 0 0]
- 100번째에만 1이 위치하고 나머지는 0인 vector 가 된다.
- T_x^(i) : i번째 training example의 입력 sequence 길이
Recurrent Neural Network Model
- Problems
- inputs, outputs can be different lengths in different examples.
- doesn't share features learned across different positions of text.
- RNN은 앞선 입력들만을 이용하기 때문에 뒤에 있는 정보는 이용하지 않는다.
- 이를 해결한 것은 Bidirectional RNN (BRNN)
- a^<1> = g(w_aa * a^<0> + w_ax * x^<1> + b_a)
- y_hat^<1> = g(w_ya * a^<1> + b_y)
- Simplified RNN notation
- a^ = g(W_aa * a^ + W_ax * x^ + b_a)
- y_hat^ = g(W_ya * a^ + b_y)
- W_a = [ W_aa | W_ax ]
- if sizeof( a^ ) = 100, W_aa : (100, 100)
- if sizeof( x^ ) = 10000, w_ax : (100, 10000)
- ∴ W_a : (100, 10100)
- ∵ [ W_aa | W_ax ] [ a^ | x^ ].T = W_aa * a^ + W_ax * x^ + b_a