attention mechanism - beyondnlp/nlp GitHub Wiki

어텐션(attention)이란

vanilla rnn에서 Vanishing Gradient Problem를 해결하기 위해 나온 lstm에서도 문자열이 길어지면 효과적으로 정보를 압축하지 못하는 문제가 발생하여 lstm을 사용한 seq2seq모델에서도 번역의 품질이 떨어지는 문제가 발생한다. 모든 정보를 다 사용할려는데 문제가 있다고도 볼수 있다. 이런 문제를 효과적으로 해결하기 위해 나온 알고리즘이 attention mechanism이다.( attention말고도 위에서 언급한 문제를 해결하기 위해서 bidirectional lstm으로도 어느 정도 해결이 된다. )

이를 좀 더 이해가 쉽게 설명하면 나는 맥주가 좋다는 의미의 독일어 문장(Ich mochte ein bier)과 영어 문장(I’d like a beer)을 seq2seq를 통해 번역한다고 할 때 encoder2decoder 사람은 직관적으로 beer은 bier에만 영향을 받는 다는 것을 알 수 있고 이를 알고리즘화 한 것이다. beer를 예측할 때 bier이외에 것은 별다른 신경을 쓸 필요가 없고 오히려 성능을 저하시키는 요인이 되기 때문에 bier에 집중하겠다는 의미이다. 그럼 어떤 정보에 기반하여 beer를 예측할 때는 bier에 영향받는 다는 것을 알 수 있을까? "encoder에서 bier를 입력으로 해서 만든 출력 벡터와 decoder에서 beer를 만들 때 사용하는 벡터가 서로 유사할 것이다"라는 가정에서 출발한다 ( 이 부분을 보다 보니 SMT에서 word alignment와 유사한 개념이 아닌가 하는 생각이 들었다 )

how to implement attention

그러면 어떻게 이를 구현할 것인가 attention

H(t)는 디코더의 벡터이고 H(s)는 인코더의 벡터이다 위 공식처럼

Attention Weights는 H(t)와 모든 H(s)를 내적하여 나온 벡터를 softmax를 취해 확률을 구한다.
C(t)=Context(t)는 모든 H(s)와 Attention Weights를 곱해서 더한다. ( 스칼라 값이 나올 듯 )
attention(H(t))은 Context(t)와 input(H(t))을 concatenation하고 tanh를 통과한 값이다. ( 미분 가능하기 때문에 학습이 가능할 듯 )

여기서 stanford nlp의 자료를 참고하면 calcuate attention weight keras code를 보면 아래와 같이 구현돼 있다.

 inputs = Input(shape=(input_dim,))
 #ATTENTION PART STARTS HERE
 attention_probs = Dense(input_dim, activation='softmax', name='attention_vec')(inputs)
 attention_mul = merge([inputs, attention_probs], output_shape=32, name='attention_mul', mode='mul')
 #ATTENTION PART FINISHES HERE
 attention_mul = Dense(64)(attention_mul)
 output = Dense(1, activation='sigmoid')(attention_mul)
 model = Model(input=[inputs], output=output)

input_dim * input_dim 의 matrix가 생성이 되고 activation func을 softmax를 취하고 이 maxtrix를 attention_probs로 명칭한다. 이어서 inputs와 attention_probs를 merge(행렬곱셈)하여 attention_mul을 생성하고 이를 다시 Fully Connected Layer로 64차원 output을 만들고 여기에 sigmoid를 거쳐 1차원 output을 만든다.

element-wise

어텐션에서 학습이 되는 부분

마지막으로 "A Brief Overview of Attention Mechanism" 의 미디엄 블로그 글 마지막에 아래와 같은 문장이 있다. There are many variants in the cutting-edge researches, and they basically differ in the choice of score function and attention function, or of soft attention and hard attention (whether differentiable). But basic concepts are all the same.

최첨단의 연구 분야에는 많은 변형들이 있습니다. 기본적으로 score함수와 attention 함수의 선택이 다르고 또는 soft attention과 hard attention( 미분이 가능한지 )이 다릅니다. 그런데 기본적인 개념은 모두 동일합니다.

score( Hi, ^Hi )는 fully connected network로 학습

attention 계산 방식

stanford nlp attention http://nlp.stanford.edu/pubs/emnlp15_attn.pdf
A Brief Overview of Attention Mechanism https://medium.com/@Synced/a-brief-overview-of-attention-mechanism-13c578ba9129
어텐션 매커니즘 https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/10/06/attention/
keras attention https://github.com/philipperemy/keras-attention-mechanism/blob/master/attention_dense.py
backend 개발자의 neural machine translation https://www.slideshare.net/deview/224-backend-neural-machine-translation-67608580
Attention Models (D2L12 Insight@DCU Machine Learning Workshop 2017) https://www.slideshare.net/xavigiro/attention-models-d2l12-insightdcu-machine-learning-workshop-2017
Attention mechanisms with tensorflow https://www.slideshare.net/KeonKim/attention-mechanisms-with-tensorflow