[텍스트마이닝][5주차 2] Embeddings - mingoori0512/minggori GitHub Wiki

Embeddings

  1. Static embeddings

word2vec(skipgram): 단어 하나에 같은 임베딩

  1. contextualized word embedding

ELMo, BERT(LM): 문장의 문맥에 따라 다른 임베딩

Contextualized embeddings(질문하기)

  • Models for learning static embeddings learn a single representation for a word type.

Type and tokens

  • Type: bears

  • Tokens:

    • The bears ate the honey

    • We spotted the bears from the highway

    • Yosemite has brown bears

    • The chicago bears didn't make they playoffs

Contextualized word representations

  • Big idea: transform the representation of a token in a sentence(e.g. from a static word embedding) to be a sensitive to its local context in a sentence and trainable to be optimized for a specific NLP task.

BERT

  • [CLS]: classification: 단어를 압축해 놓은 하나의 embedding

  • [SEP]: 문장을 구분하기 위해서

  • 특정 단어들을 쪼갬(Word piece): 원형도 이해하고, 과거형도 이해