Deep Learning - jjin-choi/study_note GitHub Wiki

Improving Deep Neural Networks: Hyperparameter tuning

T

Optimization algorithms

Mini-batch gradient descent

  • Vectorization allows you to efficiently compute on m examples.

    • X = [x^(1) x^(2) ... x^(m)] / size: (n_x, m)
    • Y = [y^(1) y^(2) ... y^(m)] / size: (1, m)
  • Let's say that you split up your training set into smaller, little baby training sets and these baby training sets are called mini-batches.

    • X = [ X^{1} X^{2} ... X^{5000}] where X^{t}: x^(1) x^(2) ... x^(M), M: mini-batch size

※ ( ) superscript round brackets x^(i) : i-th training sample
superscript square brackets z^[l] : l-th layer of the neural network
curly brackets X^{t} : t-th mini batches

Understanding mini-batch gradient descent

  • Training with mini batch gradient descent

    • cost function J^{t} using just X^{t} Y^{t}
    • it should trend downwards, but it's also going to be a little bit noisier.
  • the parameters you need to choose is size of your mini batch

    • if mini-batch size = m : Batch gradient descent, (X^{1}, Y^{1}) = (X, Y)
    • if mini-batch size = 1 : Stochastic gradient descent, but lose almost all your speed up from vectorization.
    • In practice : somewhere in between 1 and m, getting a lot of vectorization and making progress without needing to wait until you process the entire training set.
      • if small test set (m <= 2000) use batch gradient descent
      • typical mini-batch sizes would be anything from 64 up to maybe 512 power of 2.
      • one last tip is to make sure that your mini-batch, all of your X^{t}, Y^{t} that fits in CPU/GPU memory.

Understanding exponentially weighted averages (지수적 가중 평균)

  • Exponentially weighted averages +

Sequence models

Recurrent Neural Networks

Notation

  • x^ : 어떤 sequence의 t 위치에 있는 것을 의미
  • T_x : 입력 sequence의 길이
  • T_y : 출력 sequence의 길이
  • X^(i) : i번째 training example
  • X^(i) : i번째 training example의 t번째 요소
    • apple이라는 단어가 사전에 100번째에 위치한다면, X^(i) = [0 0 ... 1 0 0 0]
    • 100번째에만 1이 위치하고 나머지는 0인 vector 가 된다.
  • T_x^(i) : i번째 training example의 입력 sequence 길이

Recurrent Neural Network Model

  • Problems
    • inputs, outputs can be different lengths in different examples.
    • doesn't share features learned across different positions of text.
  • RNN은 앞선 입력들만을 이용하기 때문에 뒤에 있는 정보는 이용하지 않는다.
    • 이를 해결한 것은 Bidirectional RNN (BRNN)
  • a^<1> = g(w_aa * a^<0> + w_ax * x^<1> + b_a)
  • y_hat^<1> = g(w_ya * a^<1> + b_y)
  • Simplified RNN notation
    • a^ = g(W_aa * a^ + W_ax * x^ + b_a)
    • y_hat^ = g(W_ya * a^ + b_y)
    • W_a = [ W_aa | W_ax ]
      • if sizeof( a^ ) = 100, W_aa : (100, 100)
      • if sizeof( x^ ) = 10000, w_ax : (100, 10000)
      • ∴ W_a : (100, 10100)
      • ∵ [ W_aa | W_ax ] [ a^ | x^ ].T = W_aa * a^ + W_ax * x^ + b_a

Back propagation through time

⚠️ **GitHub.com Fallback** ⚠️