Transformer Decoder - sogang-nlp-paper/WNGT-2019-DGT-NLG-Track GitHub Wiki

OpenAI GPT는 pre-training시 multi-layer Transformer decoder(cite)로 language model을 학습.

Transformer Decoder

input context와 output을 concat하여 LM을 학습. Transfomer의 self-attention이 generating 할 때 input context와 output을 모두 고려한 attention이 적용 된다.

Transformer Decoder with Memory-Compressed Attention

To handle longer sequences, we modify the multi-head self-attention of the Transformer to reduce memory usage by limiting the dot products between Q and K.

Local attention: Sequence tokens are divided into blocks of similar length and attention is performed in each block independently.
Memory-compressed attention: After projecting the tokens into the query, key, and value embeddings, we reduce the number of keys and values by using a strided convolution. The number of queries remains unchanged. This modification allows us to divide the number of activations by a compression factor. In our experiments we use convolution kernels of size 3 with stride 3. In contrast to local attention layers, which only capture the local information within a block, the memory-compressed attention layers are able to exchange information globally on the entire sequence.