Improving Deep Neural Networks: Hyperparameter tuning

T

Optimization algorithms

Mini-batch gradient descent

Vectorization allows you to efficiently compute on m examples.
- X = [x^(1) x^(2) ... x^(m)] / size: (n_x, m)
- Y = [y^(1) y^(2) ... y^(m)] / size: (1, m)
Let's say that you split up your training set into smaller, little baby training sets and these baby training sets are called mini-batches.
- X = [ X^{1} X^{2} ... X^{5000}] where X^{t}: x^(1) x^(2) ... x^(M), M: mini-batch size

※ ( ) superscript round brackets x^(i) : i-th training sample
superscript square brackets z^[l] : l-th layer of the neural network
curly brackets X^{t} : t-th mini batches

Understanding mini-batch gradient descent

Training with mini batch gradient descent
- cost function J^{t} using just X^{t} Y^{t}
- it should trend downwards, but it's also going to be a little bit noisier.
the parameters you need to choose is size of your mini batch
- if mini-batch size = m : Batch gradient descent, (X^{1}, Y^{1}) = (X, Y)
- if mini-batch size = 1 : Stochastic gradient descent, but lose almost all your speed up from vectorization.
- In practice : somewhere in between 1 and m, getting a lot of vectorization and making progress without needing to wait until you process the entire training set.
  - if small test set (m <= 2000) use batch gradient descent
  - typical mini-batch sizes would be anything from 64 up to maybe 512 power of 2.
  - one last tip is to make sure that your mini-batch, all of your X^{t}, Y^{t} that fits in CPU/GPU memory.

Understanding exponentially weighted averages (지수적 가중 평균)

Exponentially weighted averages +

Sequence models

Recurrent Neural Networks

Notation

x^ : 어떤 sequence의 t 위치에 있는 것을 의미
T_x : 입력 sequence의 길이
T_y : 출력 sequence의 길이
X^(i) : i번째 training example
X^(i) : i번째 training example의 t번째 요소
- apple이라는 단어가 사전에 100번째에 위치한다면, X^(i) = [0 0 ... 1 0 0 0]
- 100번째에만 1이 위치하고 나머지는 0인 vector 가 된다.
T_x^(i) : i번째 training example의 입력 sequence 길이

Recurrent Neural Network Model

Problems
- inputs, outputs can be different lengths in different examples.
- doesn't share features learned across different positions of text.
RNN은 앞선 입력들만을 이용하기 때문에 뒤에 있는 정보는 이용하지 않는다.
- 이를 해결한 것은 Bidirectional RNN (BRNN)
a^<1> = g(w_aa * a^<0> + w_ax * x^<1> + b_a)
y_hat^<1> = g(w_ya * a^<1> + b_y)
Simplified RNN notation
- a^ = g(W_aa * a^ + W_ax * x^ + b_a)
- y_hat^ = g(W_ya * a^ + b_y)
- W_a = [ W_aa | W_ax ]
  - if sizeof( a^ ) = 100, W_aa : (100, 100)
  - if sizeof( x^ ) = 10000, w_ax : (100, 10000)
  - ∴ W_a : (100, 10100)
  - ∵ [ W_aa | W_ax ] [ a^ | x^ ].T = W_aa * a^ + W_ax * x^ + b_a

Deep Learning - jjin-choi/study_note GitHub Wiki

Improving Deep Neural Networks: Hyperparameter tuning

T

Optimization algorithms

Sequence models

Recurrent Neural Networks

Back propagation through time

⚠️ GitHub.com Fallback ⚠️

Deep Learning - jjin-choi/study_note GitHub Wiki

Improving Deep Neural Networks: Hyperparameter tuning

T

Optimization algorithms

Sequence models

Recurrent Neural Networks

Back propagation through time

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️