[텍스트마이닝][5주차 1] Embeddings - mingoori0512/minggori GitHub Wiki

Distributed representation

  • Vector representation that encodes information about the distribution of contexts a word appears in

  • Words that appear in similar contexts have similar representations(and similar meanings, by the distributional hypothesis)

보충)

분산 표현은 원-핫 벡터의 문제점인 고차원과 내적 값이 항상 '0'인 부분을 해결하기 위한 방법이다.

먼저, sparse vector의 요소 값은 0 또는 1인 이진값을 가지며, dense vector의 요소값은 연속형의 실수 값을 가진다.

즉, 분산 표현은 sparse한 벡터를 dense한 벡터로 변환하여 표현하는 것이다.

Word embeddings

  • Pre-trained word embeddings great for words that appear frequently in data

  • Unseen words are treated as UNKs and assigned zero or random vectors; everything unseen is assigned the same representation.

Shared structure

Even in languages like English that are not agglutinative(교착어는 고립어와 굴절어의 중간적 성격을 띠는 것으로 어근과 접사에 의해 단어의 기능이 결정되는 언어의 형태이다. ) and aren't highly inflected(굴절), words share important structure.

Even if we never see the word "unfriendly" in our data, we should be able to reason about is as: un + friend + ly

Subword models

  • Rather than learning a single representation for each word type w, learn representations z for the set of ngrams that comprise it

문장의 단위인 ngram으로 나누어 학습

  • The word itself is included among the ngrams(no matter its length).

  • A word representation is the sum of those ngrams

FastTest

e(*) = embedding for *

How do we use word embeddings for document classification?

sum, average, weighted sum, max

Attention(주목도)

어떤 단어가 중요한지 판단해서 weight를 주는 것

  • Let's incorporate(통합, 합병시키다) structure(and parameters) into a network that captures which elements in the input we should be attending to(and which we can ignore).

어떤 input x가 더 중요하다고 판단할지 weight를 주는 것

  • Define v to be a vector to be learned; think of it as an "important word" vector. The dot product here measures how similar each input vector is to that "important word" vector

  • convert r into a vector of normalized weights that sum to 1.

a = softmax(r)

  • Lots of variations on attention:

    • Linear transformation of x into before dotting with v(이해 안됨)

    • Non-linearities after each operation

    • "Multi-head attention": multiple v vectors to capture different phenomena that can be attended to in the input.

    • Hierarchical attention(sentence representation with attention over words + document representation with attention over sentences.)

    -> word level + sentence level을 동시에 둘 다 고려할 수 있음

  • Attention gives us a normalized weight for every token in a sequence that tells us how important that word was for the prediction

  • This can be useful for visualization

RNN

  • With an RNN, we can generate a representation(time에 관해서 이전 정보들 학습) of the sequence as seen through time t.

  • This encodes a representation of meaning specific to the local context word is used in.

  • We can then swap that RNN time step output for the embeddings we used earlier

What about the future context?

Bidirectional RNN

  • A powerful alternative is make predictions conditioning both on the past and the future.

  • Two RNNs

    • One running left-to-right

    • One right-to-left

  • Each produces an output vector at each time step, which we concatenate(forward RNN + backward RNN)

  • The forward RNN and backward RNN each output a vector of size H at each time step, which we concatenate into a vector size of 2H.

  • The forward and backward RNN each have separate parameters to be learned during training.

Training BiRNNs

  • Given this definition of an BiRNN:

  • We have 8 sets of parameters to learn(3 for each RNN + 2 for the final layer)

-RNN을 어떻게 구성하는지는 본인의 몫임

Stacked RNN

  • Multiple RNNs, where the output of one layer becomes the input to the next.

Contextualized embeddings

  • Models for learning static embeddings learn a single representation for a word type.

Types and tokens

  • Type: bears

  • Tokens:

    • The bears ate the honey

    • We spotted bears from the highway

    • Yosemite has brown bears

    • The chicago bears didn't make the playoffs

여기서 마지막의 bears는 team의 마스코트: 사람들로 구성된 미식 축구팀을 의미하는 것임. -> static vector가 쓰이면 안된다.

Contextualized word representations

  • Big idea: transform the representation of token in a sentence(e.g. from a static word embedding) to be sensitive to its local context in a sentence and trainable to be optimized for a specific NLP task.
  1. 작은 모델

ex. Word2Vec, 단어에 대한 문맥(순서 고려 X) 단어 자체만 embedding으로 사용

  1. 큰 모델

등장: 2018~2019년도

ex. ELMo, Bert, GPT, 백과사전 기반의 역사까지도 포함, 단어에 대한 문맥과 흐름, 순서, 언어적 구조까지도 학습함(언어의 구조적 특징, 특징 도메인의 특징도 학습)

  • ELMo: Stacked BiRNN trained to predict next word in language modeling task(Peters et al. 2018)

  • BERT(Google): Transformer-based model to predict 1) masked word using bidirectional context + 2) next sentence prediction(단어와 문장을 맞추도록)(Devlin et al. 2019)

ELMo

  • Peters et al. (2018), "Deep contextualized Word Representations"(NAACL:북미컴퓨터언어학회)

  • Big idea: transform the representation of word(e.g. from a static word embedding) to be sensitive to its local context in a sentence and optimized for a specific NLP task.

  • Output = word representations that can plugged into just about any architectrue a word embedding can be used.

  • Train a bidirectional RNN language model with L layers on a bunch of text.

  • Learn parameters to combine the RNN output across all layers for each word in a sentence for a specific task(NER, semantic role labeling, question answering etc.). Large improvements over SOTA for lots of NLP problems.

BERT

  • Learn the parameters of this model with two objectives:

    • Masked language modeling

    • Next sentence prediction

Masked LM

  • Mask one word from the input and try to predict that word as the output

  • More powerful than RNN LM(or even a BiRNN LM) since it can reason about context on both sides of the word being predicted.

  • A BiRNN models context on both sides, but each RNN only has access to information from one direction

Next sentence prediction

  • For a pair of sentences, predict from [CLS] representation whether they appeared sequentially in the training data:
  • [CLS] The dog bark #ed [SEP] He was hungry
  • [CLS] The dog bark #ed [SEP] Paris is in France
  • Deep layers(12 for BERT base, 24 for BERT large)

  • Large representation sizes(768 per layer)

  • Pretrained on English Wikipedia(2.5B words) and BooksCorpus(800M words)

정리

  • Word embeddings can be substituted for one-hot encodings in many models(MLP, CNN, RNN, logistic regression).

  • Subword embeddings allow you to create embeddings for word not present in training data; require much less data to train.

  • Attention gives us a mechanism to learn which parts of a sequence to pay attention more in forming a representation of it.

  • BiLSTMs(양방향 장단기 기억) can transform word embeddings to be sensitive to their use in context.

보충: Recurrent Neural Networks(RNN)과 RNN의 일종인 Long Short-Term Memory models(LSTM)

시퀀스 길이에 관계없이 인풋과 아웃풋을 받아들일 수 있는 네트워크 구조이기 때문에 필요에 따라 다양하고 유연하게 구조를 만들 수 있다는 점이 RNN의 가장 큰 장점입니다.

  • Static word embeddings(Word2Vec, Glove) provide representations fo word types; contextualized word representations(ELMo, BERT) provide representations of tokens in context.