[텍스트마이닝][3주차 2] Language Model - mingoori0512/minggori GitHub Wiki

Language Model

vocabulary V is a finite set of discrete symbols(e.g. words, characters); V =|V|
V+ is the infinite set of sequences of symbols from V; each sequence ends with STOP
x is an element of V+
P(w) = P(w1,...,wn)

P("Call me Ishmael") = P(w1="call", w2="me", w3="Ishmael") X P(STOP)
Language Models provide us with a way to quantify the likelihood of a sequence-i.e. plausible sentences(말이 되는지 안되는지)
OCR(Optical Character Recognition), Machine translation(1. fidelity to source context(충실도) 2. fluency of the translation(얼마나 해당 나라 사람들과 비슷하게 유창하게), Query Auto Completion(검색어 자동 완성), Speech Recognition, Dialogue generation

현재 혹은 다음에 올 상태는 이전의 상태들에 의해 결정된다.

이전의 몇 개를 볼 것인가?

What we learn in estimating language model is P(word|context), where context-at least here-is the previous n-1 words(for ngram of order n)
We have one multinominal over the vocabulary(including STOP) for each context

(주사위 1~6은 독립적이지만, LM의 각 단어들은 독립적이지 않다고 보는 것임)
As we sample, the words we generate form the new context we condition on(각 단어에 확률값을 부여하는 것 자체가 Language Model)

The best evaluation metrics are external-how does a better language model influence the application you care about?
Speech recognition(word error rate), machine translation(BLEU score), topic models(sensemaking)
A good language model should judge unseen real language to have high probability
Perplexity(당황) = inverse(역) probability of test data, averaged by word.
To be reliable, the test data must be truly unseen(including knowledge of its vocabulary).

training(training models)

development(model selection: hyperparameter tuning)

testing(evaluation; never look at it until the very end)

언어모델을 평가하는 방법 중에 하나

perplexity(복잡도-얼마나 혼란스러운지)가 낮아지면 모델의 성능은 올라가는 것임

Perplexity

Unigram>Bigram>Trigram

When estimating a language model, we're relying on the data we've observed in a training corpus.
Training data is a small(and biased) sample of the creativity of language.

one solution: add a little probability mass to every element.

As ngram order rises, we have the potential for higher precision but also higher variability in our estimates.
A linear interpolation of any two language models p and q is also a valid language model.
We can use this fact to make higher-order language models more robust.
How do we pick the best values of lambda?
- Grid search over development corpus
- Expectation-Maximization algorithm(treat as missing parameters to be estimated to maximize the probability of the data we see.)