Ask Me Anything: Dynamic Memory Networks for Natural Language Processing - Deepest-Project/Greedy-Survey GitHub Wiki

Resources

Background

Gated Recurrent Network (GRU)

GRU is a type of Recurrent Neural Network(RNN). It accepts inputs as x_t and updates its hidden state h_t for each timestep t according to the following formula.

x_t is the input vector at timestep t. W, U, and b are learnable parameters. With these values, we calculate z_t, r_t, h_t tilde, and h_t, which are all vectors. A single timestep operation of a GRU can be summarized as h_t = GRU(x_t, h_{t-1}).

Now let us analyze the above mechanism in terms of z_t and r_t with a bit of intuition. Since both are an output of the sigmoid function, they are values between 0 and 1. The intermediate variable r_t, called the reset gate vector, determines how much to retain the previous hidden state. On the other hand variable z_t, called the update gate vector, determines the ratio with which to mix the previous hidden state and the current hidden memory.

Introduction

The importance of QA(Question Answering) problems in NLP:

Most, if not all, tasks in natural language processing can be cast as a question answering problem

Dynamic Memory Network (DMN):

... a neural network based framework for general question answering tasks that is trained using raw input-question-answer triplets.

The DMN first computes a representation for all inputs and the question. The question representation then triggers an iterative process that searches the inputs and retrieves relevant facts. The DMN memory module then reasons over retrieved facts and provides a vector representation of all relevant information to an answer module which generates the answer.

Dynamic Memory Networks

DMN detailed

The forward propagation algorithm is described below for each module.

Input Module

T_C๊ฐœ์˜ input sentence๋ฅผ ๋ฐ›์•„์„œ truth representation vector๋กœ ๋ณ€ํ™˜

  1. ๋ฌธ์žฅ์˜ ๋‹จ์–ด๋ฅผ embedded vector๋กœ ๋ณ€ํ™˜ (pretrained GloVe)
  2. ๊ฐ ๋ฌธ์žฅ ๋’ค์— end-of-sentence ํ† ํฐ์„ ๋ถ™์—ฌ ํ•˜๋‚˜๋กœ concatenate
  3. h_t = GRU(w_t, h_{t-1})
  4. end-of-sentence ํ† ํฐ์„ ๋งŒ๋‚  ๋•Œ์˜ h_t๋ฅผ c_i๋ผ๊ณ  ๋ถ€๋ฆ„

=> c_1, c_2, cdots, c_{T_c}: T_c๊ฐœ์˜ ๋ฒกํ„ฐ๋ฅผ ๋ฐ˜ํ™˜

Question Module

T_Q๊ฐœ์˜ ๋‹จ์–ด๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋Š” ์งˆ๋ฌธ์„ ๋ฐ›์•„์„œ question representation vector๋กœ ๋ณ€ํ™˜

  1. Input Module๊ณผ ๊ฐ™์€ embedding layer๋ฅผ ๊ฑฐ์ณ embedding vector๋กœ ๋ณ€ํ™˜
  2. q_t = GRU(w_t, q_{t-1})

=> ๋งˆ์ง€๋ง‰ q_{T_Q}๋ฒˆ์งธ hidden state ๋ฒกํ„ฐ๋ฅผ ๋ฐ˜ํ™˜

Episodic Memory Module

i = 0
m_0 = q
while not end_condition:
    for t in range(T_C+1, 1):
        1. calculate feature vector z(c_t, m^{i-1}, q) where
           z(c, m, q) = [c, m, q, c*m, c*q, |c-m|, |c-q|, c*W*m, c*W*q]
        2. calculate the degree of attention, g_t^i. This is a two-layer fully connected network forward pass.
        3. h_t^i = g_t^i * GRU(c_t, h_{t-1}^i) + (1-g_t^i) * h_{t-1}^i
    4. e^i = h_{T_C}^i
    5. m^i = GRU(e^i i, m^{i-1})
    6. i = i + 1

=> ๋งˆ์ง€๋ง‰ m^{T_M} hidden state ๋ฒกํ„ฐ๋ฅผ ๋ฐ˜ํ™˜

Answer Module

while argmax(y_t) is not end_of_sequence:
    1. a_t = GRU(y_{t-1}.q, a_{t-1})
    2. y_t = softmax(W * a_t)

The answer generated in the t'th timestep is argmax(y_t).

Training

The facebook bAbI dataset provides training examples that each contains several information sentences, a question sentence, the answer to the question, and which information was the most important one in answering the question. With such label, we first train the episodic memory module that calculates the degree of attention, g_t^i. After the attention is trained reasonably, we add up the loss functions for the attention and answering part and continue training.

Performance

DMN models are trained separately for each question type (information reasoning, sentiment analysis, part-of-speech tagging, ...). It achieved SOTA (at that time) for most criteria. Refer to the paper for details.

Discussion

์ฃผ์–ด์ง„ ๋ฌธ์žฅ์ด ๋งค์šฐ ๋งŽ์„ ๋•Œ๋Š” ์ž˜ ๋™์ž‘ํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ํ•œ๋‹ค. ์•„๋ฌด๋ž˜๋„ ๋ฒกํ„ฐ ํ•˜๋‚˜๋กœ ์งˆ๋ฌธ๊ณผ ๊ด€๋ จ๋œ ๋ชจ๋“  ์ •๋ณด๋ฅผ ์š”์•ฝํ•˜๋ ค ํ•˜๊ธฐ ๋•Œ๋ฌธ์ธ๋“ฏ ์‹ถ๋‹ค. ๊ฐ€๋ น ์ด ๋ชจ๋ธ์„ ์ฑ…์„ ์ฝํ˜€์„œ ์‹œํ—˜๋ฌธ์ œ๋ฅผ ํ’€๊ฒŒ ํ•œ๋‹ค๋˜๊ฐ€ ํ•˜๋Š” ๊ฑด ์–ด๋ ค์šธ ๊ฒƒ ๊ฐ™๋‹ค. Short-term memory๊ฐ€ ์œ ์šฉํ•œ ๋ถ„์•ผ์— ์จ๋ณด๋˜๊ฐ€, ์•„๋‹ˆ๋ฉด short-term memory๊ฐ€ long-term memory๋กœ ๋„˜์–ด๊ฐ€๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‘์šฉํ•ด์„œ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋ณผ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ?