[텍스트마이닝][3주차 2] Language Model - mingoori0512/minggori GitHub Wiki

Language Model

  • vocabulary V is a finite set of discrete symbols(e.g. words, characters); V =|V|

  • V+ is the infinite set of sequences of symbols from V; each sequence ends with STOP

  • x is an element of V+

  • P(w) = P(w1,...,wn)

    P("Call me Ishmael") = P(w1="call", w2="me", w3="Ishmael") X P(STOP)

  • Language Models provide us with a way to quantify the likelihood of a sequence-i.e. plausible sentences(말이 되는지 안되는지)

  • OCR(Optical Character Recognition), Machine translation(1. fidelity to source context(충실도) 2. fluency of the translation(얼마나 해당 나라 사람들과 비슷하게 유창하게), Query Auto Completion(검색어 자동 완성), Speech Recognition, Dialogue generation

Information theoretic view(정보 이론의 관점)

Language Model

  • Language modeling is the task of estimating P(w)

  • Why is this hard?

Markov Assumption

현재 혹은 다음에 올 상태는 이전의 상태들에 의해 결정된다.

이전의 몇 개를 볼 것인가?

  1. bigram model(first-order markov): 하나를 보고 1개를 맞추는 것

  2. trigram model(second-order markov): 2개를 보고 1개를 맞추는 것

Generating

  • What we learn in estimating language model is P(word|context), where context-at least here-is the previous n-1 words(for ngram of order n)

  • We have one multinominal over the vocabulary(including STOP) for each context

    (주사위 1~6은 독립적이지만, LM의 각 단어들은 독립적이지 않다고 보는 것임)

  • As we sample, the words we generate form the new context we condition on(각 단어에 확률값을 부여하는 것 자체가 Language Model)

Evaluation

  • The best evaluation metrics are external-how does a better language model influence the application you care about?

  • Speech recognition(word error rate), machine translation(BLEU score), topic models(sensemaking)

  • A good language model should judge unseen real language to have high probability

  • Perplexity(당황) = inverse(역) probability of test data, averaged by word.

  • To be reliable, the test data must be truly unseen(including knowledge of its vocabulary).

Experiment design

training(training models)

development(model selection: hyperparameter tuning)

testing(evaluation; never look at it until the very end)

Perplexity

언어모델을 평가하는 방법 중에 하나

perplexity(복잡도-얼마나 혼란스러운지)가 낮아지면 모델의 성능은 올라가는 것임

Perplexity

Unigram>Bigram>Trigram

Smoothing

  • When estimating a language model, we're relying on the data we've observed in a training corpus.

  • Training data is a small(and biased) sample of the creativity of language.

Data Sparsity(한번도 같이 쓰이질 않을 단어들이 많음)

  • As in Naive Bayes, P(wi)=0 causes P(w)=0. (perplexity?)

Smoothing in Naive Bayes

one solution: add a little probability mass to every element.

Additive smoothing

  • Laplace smoothing(alpha=1)

  • Maximum Likilihood Estimation vs smoothing with alpha=1

  • Smoothing is the re-allocation(재할당) of probability mass

  • How can best re-allocate probability mass?

Interpolation(알려진 데이터 지점의 고립점 내에서 새로운 데이터 지점을 구성하는 방식)

  • As ngram order rises, we have the potential for higher precision but also higher variability in our estimates.

  • A linear interpolation of any two language models p and q is also a valid language model.

  • We can use this fact to make higher-order language models more robust.

  • How do we pick the best values of lambda?

    • Grid search over development corpus

    • Expectation-Maximization algorithm(treat as missing parameters to be estimated to maximize the probability of the data we see.)