자연어 처리 - newlife-js/Wiki GitHub Wiki

NLP(Natural Language Processing)

by 서울대학교 황승원 교수님

Language Modeling

N-gram model

N-1개의 단어가 주어졌을 때, 해당 단어가 나올 확률을 구하는 것(count-based)

training corpus와 test corpus가 비슷할 때 성능이 좋음(일상생활에서는 그렇지 않은 경우가 많음)

smoothing

많이 관찰된 단어의 확률을 관찰되지 않은 단어들에게 나누어주어 generalize하는 방법

Add-one estimation

관찰되지 않은 단어들에게 1 만큼의(+normalize) 확률을 부여함

1대신 k를 사용하기도 함(Add-k estimation)

※ count-based LM은 비슷한 뜻을 가지고 있거나, 단어의 변형이 있는 것을 고려하지 않음

Word Embedding

단어를 vector로 정의해 특정 공간에 embedding(mapping)하는 것
비슷한 단어는 비슷한 position을 가지도록 mapping

Word2Vec

word w가 특정 단어 가까이에서 나타날 확률을 계산하여 vector화 시킴
input vector로부터 target vector의 확률을 구하는 hidden layer의 weight를 학습하는 것

■ 학습 방법
참고

Skip-gram: 중심 단어로부터 주변 단어를 예측

구현
CBOW(Continuous Bag of Words): 주변 단어로부터 중심 단어를 예측

Bag of Words model

Vector representation이 단어의 순서를 고려하지 않음
Negataion을 적절히 처리하지 못함

Term Frequency(tf)

documnet d 안에 존재하는 term t의 빈도수

Document Frequency

document마다 특정 단어가 나타나는지
모든 documnet에 나타나는 단어라면 중요하지 않은 단어(a, the 등)
※ idf(inverse document frequency) weight: 중요도를 나타내기 위해 inverse를 취함[log(N/df)]

Bayes' Rule

c: class(positive/negative), d: document
P(c|d): d는 처음 보는 것이므로 구하기 어려움..
p(c)는 training set에서 positive와 negative 비율을 구할 수 있음
P(d|c)는 training corpus로부터 구할 수 있음(positive일 때 각 word가 나타날 확률이 있으므로 이 확률들의 곱이 document가 나타날 확률)

Language model evaluation

precision / recall: accuracy만으로는 제대로 평가하지 못하므로 precision과 recall 같이 사용
perplexity(당혹스러운 정도): 예측이 얼마나 헷갈리는지, 낮을수록 model의 성능이 좋음

backoff

N-gram을 쓰는 LM에서도 성능이 더 좋다면 더 작은 n-gram을 사용할 수 있음

interpolation

여러 N-gram의 가중치 합 사용

Word Meaning

단어의 의미 자체(동의어, 유의어, 반의어, 관련성)도 고려하도록 modeling하는 노력이 있어 왔음
-> WordNet: 단어 간 관계를 정리한 ontology
synonym / antonym / hypernym / hyponym
pos(part of speech): 단어의 품사
lemma: 단어의 기본형

딥러닝 자연어 처리

by 서울대학교 정교민 교수님

Recurrent Neural Network(RNN)

CNN과 같은 feed forward neural network는 정보가 한 방향으로만 전달됨(acyclic directed graph structure)
시간에 대한 개념이 존재하지 않았음
Recurrence(directed cycles) 개념을 도입하여 time과 memory 개념을 넣을 수 있었음

Bidirectional RNN(BRNN)

past states와 future state 모두 dependent한 RNN model
(예: missing word 찾기와 같은 문제에서 앞뒤 모두 참조하는 것이 좋음)

Vanishing gradients

이전 input에 의한 영향이 시간에 따라 점점 작아짐

Long Short-Term Memory(LSTM)

gradient information을 보존하는 방법

LSTM Block(3 gates)

input / forget / output gate

Applications

Machine translation
Text sequence generation
Speech recognition
Question answering
Image to text
Semantic analysis

Seq2Seq Encoder-Decoder

Encoder: word sequence -> sentence representation(real-valued vector)
Decoder: representation -> word sequence distribution

Attention Model

source vector sequence 중에서 어떤 vector에 attention할 지를 디코더가 결정
Context Vector(c_i): attention에 의한 가중치 합(∑α_ij*h_j)
attention weight α는 alignment score function의 softmax값(target word와 source word가 align될 확률)

Self-Attention

하나의 문장에서 각 word가 서로에게 attention하는 정도를 구하는 방법(context-sensitive encodings)
RNN과 달리(RNN은 중간에 위치한 단어들을 거쳐거쳐 계산), 멀리 위치한 단어 사이의 관계를 직접 계산할 수 있음

Evaluation Metrics

BLEU score: 일치하는 n-gram(연속된 n개의 단어)의 개수로 두 문장의 유사도를 측정
Perplexity: 단어들의 학습된 확률 분포가 input text의 확률 분포와 얼마나 유사한지 측정
METEOR: BLUE + 동의어, 단/복수 차이, 시제 차이 등 고려하여 유사성을 측정

Word Embedding

A representation that maps words to real-valued vectors

Word2Vec

참고
Dense representation
비슷한 context를 가진 단어들끼리 가까운 space에 위치하도록 vector화하는 word embedding 방법
target word(center word)를 중심으로 window를 구성(context)하고, sliding window 방식으로 처리

CBOW(Continuous Bag-of-Words): context가 주어졌을 때 center word를 예측
훈련이 빠르고, 자주 쓰이는 단어들에 정확도가 높음
Skip-gram: center word가 주어졌을 때, context word에 대한 softmax probability vector를 예측
자주 쓰이지 않는 단어에 유리

W: word vector look-up table(word vectors)
※ GloVe: Word2Vec에 co-occurence 빈도수를 적용한 word-embedding

Subword Tokenization

BPE(Byte Pair Encoding): BERT 기반 모델들에 쓰이는 subword tokenization
unknown vocab(Out-of-vocab) 문제 해결, 의미 있는 패턴(subword)로 단어를 자른다.(lowest -> low, est)
subword를 bi-gram pair로 묶어서, 가장 많이 등장하는 pair를 merge하여 vocab에 넣는 과정을 반복

같은 알파벳을 공유하는 경우 임베딩을 공유할 수 있음(low resource 언어 학습에 도움)

WordPiece: BPE에서의 merge 기준(빈도수)를 Corpus likelihood가 가장 높은 것으로 바꾼 것
SentencePiece: 사전 토큰화(pre-tokenization) 작업이 어려운 언어의 단어 분리 토큰화를 수행

CNN 이용한 감성 분석

embedding matrix에 대해 convolution을 수행
x축(embedding dimension)에 대해서는 적용하지 않고,
y축(단어의 시간 축)에 대해 적용(filter의 y축 길이를 다양하게 적용하여 bigram, trigram, 4-gram 사용하는 효과를 줌)

History of LM

MASS(MAsked Sequence to Sequence)

BERT: AutoEncoding(masked word prediction)
GPT: AutoRegressive
좋은 문장을 생성하기 위해서는 AutoRegressive objective가 필요
Contextual representation(문장의 이해)은 AutoEncoding objective가 좋음

BART(Bidirectional and Auto-Regressive Transformer)

Bidirectional Encoder + AutoRegressive Decoder

input에 noise를 주고 original input을 예측하도록 학습

Token masking: BERT처럼 mask를 주고 예측
Token deletion: 임의의 토큰을 삭제하고, 삭제한 토큰의 위치 예측
Text infilling: 임의 길의(Poisson 분포)의 text를 mask로 대체하고, 해당 text의 길이를 예측
Sentence permutation: 문장의 순서를 셔플하고, 원래 순서를 예측
Document rotation: 임의의 토큰을 선택하여 해당 토큰을 시작으로 문서 회전 -> 시작 토큰이 무엇인지 예측

TNF(Taking Notes on the Fly)

low-frequency words는 학습의 기회가 적어 embedding의 quality가 낮음
rare word에 대한 note dictionary를 구성하여 예측에 사용(rare word에 대한 더 정확한 정보들을 추가해줌)
rare word 선정 -> 해당 word에 mapping하는 contextual vector 구성
input embedding에 note dictionary 정보를 추가(note embedding)
construct -> leverage -> update note dictionary

PMI-Masking(Pointwise Mutual Information)

token 간의 collocation(연관된 연속 단어)이 있는 token들을 함께 masking
Collocations(correlated word n-grams): Multi-word expressions, phrases
vocabulary의 size가 작을수록 sub-word tokenization에서 word를 더 많이 나누게 됨(vocab에 나타나지 않는 단어가 많으므로)
잘못 나누어진 sub-word token을 masking하게 되면 model의 성능이 떨어짐
PMI index for bi-gram:
PMI ranking이 높은 token은 묶어서 masking
PMI index for n-grams:
n-gram으로 확장을 하면 n-gram의 subset의 한 개만 PMI가 높아도 전체 PMI가 높아질 수 있으므로, subset의 PMI중 min을 취한다

Big Bird: Transformers for Longer Sequences

Transformer 모델의 Self-attention은 연산량의 부담이 큼, O(#token^2) Self-attention 연산을 Random + Locality(Window) + Global graph로 치환, O(#token)

Augmented SBERT(Sentence BERT)

Cross-Encoder: 두 문장을 하나의 BERT의 input으로 합쳐서 넣음
Bi-Encoder: 각 문장을 각각의 BERT에 넣어 cosine-similarity를 계산(연산량 적음, 대신 성능은 안좋음)

-> 성능 보완을 위해 data augmentation 도입

기존 labeling된 gold training set을 re-combine한 pairs들을 cross-encoder(BERT) 통해서 labeling한 후 sampling(다양한 strategy 존재)하여 Bi-Encoder에 넣음

ALBERT(A Lite BERT)

Word input -> Dense embedding은 #word * embedding dimension의 parameter 수를 가짐

Parameter reduction

연산량을 줄이기 위해 성능의 저하를 최소화하면서 parameter 수를 줄임

parameter 수를 줄이기 위해 input과 embedding 사이에 작은 dimension의 layer를 추가
layer별로 feed-forward의 parameter를 공유하도록 함

Performance Boosting

Sentence Order Prediction(SOP) Next Sentence Prediction(NSP)에 비해 더 어려운 문제임
다양한 task를 모두 커버하기 위해 SOP task에 대해 학습한 모델을 사용
add data + remove dropout

RoBERTa(Robustly optimized BERT approach)

BERT 의 성능 향상을 위해 학습시간/배치 사이즈/데이터 늘림
NSP task가 경우데 따라 성능 저하를 유발하므로 제거
더 긴 sequence에 대해 학습
10가지 pattern의 masking(dynamic masking)을 사용

Unified Language Model

다양한 task(left-to-right, right-to-left, bidirectional, seq-to-seq) 커버 가능한 하나의 통합된 모델 구성
self-attention mask를 모델별로 다르게 구성하여 통합

T5(Text-to-Text Transfer Transformer)

하나의 모델을 사용하지만, input에 문제를 정의해 줌
model, loss function, hyper parameter 모두 공유함

자연어 처리 - newlife-js/Wiki GitHub Wiki

NLP(Natural Language Processing)

Language Modeling

N-gram model

smoothing

Add-one estimation

Word Embedding

Word2Vec

Bag of Words model

Term Frequency(tf)

Document Frequency

Bayes' Rule

Language model evaluation

backoff

interpolation

Word Meaning

딥러닝 자연어 처리

Recurrent Neural Network(RNN)

Bidirectional RNN(BRNN)

Vanishing gradients

Long Short-Term Memory(LSTM)

LSTM Block(3 gates)

Applications

Seq2Seq Encoder-Decoder

Attention Model

Self-Attention

Evaluation Metrics

Word Embedding

Word2Vec

Subword Tokenization

CNN 이용한 감성 분석

최신 자연어 처리 기법

Transformer

장점: RNN에 비해 병렬화가 쉬움(연산이 빠름), 단어들 간의 관계를 더 잘 볼 수 있음

단점: attention을 계산하기 위해서는 input size가 fix되어야 함(input의 maximum size로 설정하고 더 긴거 나오면 버리는 식으로..)

Positional Embedding

Encoder

Multi-head attention

Feed Forward Network

Decoder

BERT(Bidirectional Encoder Representations from Transformers)

pre-training

Input representation

Fine-tuning

GPT(Generative Pre-Training)

GLUE(General Language Understanding Evaluation) Benchmark

History of LM

MASS(MAsked Sequence to Sequence)

BART(Bidirectional and Auto-Regressive Transformer)

TNF(Taking Notes on the Fly)

PMI-Masking(Pointwise Mutual Information)

Big Bird: Transformers for Longer Sequences

Augmented SBERT(Sentence BERT)

ALBERT(A Lite BERT)

Parameter reduction

Performance Boosting

RoBERTa(Robustly optimized BERT approach)

Unified Language Model

T5(Text-to-Text Transfer Transformer)