Transformer based Encoder Decoder - sogang-nlp-paper/WNGT-2019-DGT-NLG-Track GitHub Wiki

Pre-trained Language Representation (BERT, XLNet)

BERT, XLNet ๊ฐ™์€ pretrained language representation ๋ชจ๋ธ์„ ์–ด๋–ป๊ฒŒ ์ ์šฉ ํ•  ์ˆ˜ ์žˆ์„์ง€.

  • DGT task๋Š” input์ด natural language๊ฐ€ ์•„๋‹ˆ๋ผ์„œ encoder์—์„œ ์“ฐ๊ธฐ ์–ด๋ ค์›€
  • Decoder์—์„œ pretrained word embedding์„ ์“ธ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค
  • autoregressive ๋ชจ๋ธ์ธ XLnet์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ๋‚˜์•„๋ณด์ž„(Transformer-XL based)

Transformer based Enocoder-Decoder

๋ฌธ์ œ์ : Rotowire ๋ฐ์ดํ„ฐ์—์„œ ํ•˜๋‚˜์˜ ๊ฒŒ์ž„ ๋‹น ๋Œ€๋žต 600๊ฐœ์˜ record๋กœ ๊ฐ€์ •ํ•˜๋ฉด, content selection ๊ฐ™์€ ๋ณ„๋„์˜ ์ „์ฒ˜๋ฆฌ(?), ํ•„ํ„ฐ๋ง(?) ๊ณผ์ •์„ ๊ฑฐ์น˜์ง€ ์•Š์œผ๋ฉด input์ด ๋„ˆ๋ฌด ๋งŽ์Œ

์ด์ „ ๋ชจ๋ธ์—์„œ์˜ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•: attention, copying mechanism, pointer network, gate ๊ฐ™์€ ํ…Œํฌ๋‹‰์„ ์ ์šฉํ•˜์—ฌ ๋ฌธ์ œ์ ์„ ๋ณด์™„ํ•˜๊ณ ์ž ํ•จ

Why transformer?

  • rotowire summary๋Š” ์–ด๋Š ์ •๋„์˜ ํ˜•์‹(template)์ด ์žˆ์Œ (๊ฒฝ๊ธฐ ๊ฒฐ๊ณผ, ์„ ์ˆ˜๋“ค ์Šคํƒฏ ๋“ฑ..)
  • multi-head attention์„ ํ†ตํ•ด template(?)์„ ํ•™์Šตํ•˜๋Š” ํšจ๊ณผ๋ฅผ ๊ธฐ๋Œ€..ใ…Ž

Summary๊ฐ€ ํ‰๊ท  8~10๋ฌธ์žฅ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๊ณ , ๊ฐ ๋ฌธ์žฅ์—์„œ ์ง‘์ค‘ํ•˜๋Š” ํŒŒํŠธ๊ฐ€ ๋‹ค๋ฅด๋‹ค. Multi-head attention์—์„œ ๊ฐ head๊ฐ€ ๋‹ค๋ฅธ ๋ถ€๋ถ„์— attentionํ•  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€. (Encoder์—์„œ positional encoding์ด ํ•„์š”์—†์„ ๋“ฏ)

  • (๋…ผ์˜) input์ด (N, 600, d)๊ฐ€ ๋˜๋Š”๋ฐ, record filtering์ด ํ•„์š”ํ•œ๊ฐ€? ๋Œ€๋ถ€๋ถ„์˜ record๋Š” ์•ˆ์”€ (ํŠนํžˆ ์„ ์ˆ˜ stat)

Transformer vs Transformer XL

  • fixed-length context์˜ long-term dependency ๋ฌธ์ œ๋ฅผ ๋ณด์™„(segment-level recurrence with state reuse)
  • relative positional encoding

๋…ผ์˜์‚ฌํ•ญ

  • data preprocessing(table->triplet): moduleํ™” ํ•ด์„œ ๊ฐ™์ด ์“ฐ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.
  • (์งˆ๋ฌธ) [Wiseman et al, 2017]์—์„œ copying mechanism์—์„œ {e_j}์— ๋Œ€ํ•ด ๋”ฐ๋กœ vocab dictionary๊ฐ€ ์žˆ๋Š”์ง€. {e_j<->entity}๊ฐ™์€ ๊ฒƒ์ด ํ•„์š”ํ•œ๊ฒŒ ์•„๋‹Œ๊ฐ€