[텍스트마이닝][4주차 1] Language Model - mingoori0512/minggori GitHub Wiki

Language Model

  • We can use multiclass logistic regression for language modeling by treating the vocabulary as the output space

y = V

P(Y=y|X=x,b): 이전 단어들이 등장했을 때, 다음 단어들이 등장할 확률

Unigram LM

  • A unigram language model here would have just one feature: a bias term

Richer representations(x를 어덯게 표현할 것인가..one-hot encoding or binary encoding)

  • Log-linear models give us the flexibility of encoding richer representations of the context we are conditioning on.

  • We can reason about any observations from the entire history and not just the local context.

Examples

The United States Senate opens its second impeachment trial of former President Donald J. __________

feature classes & Examples

ngrams

gappy nagrams(자주 등장하는 단어들)

spelling, capitalization

class/gazetteer membership

Neural LM

Simple feed-forward multilayer perceptron(e.g. one hidden layer)

input x = vector concatenation(이어 붙이다) of a conditioning context of fixed size k

x = [V(W1);...;V(Wk)]

one-hot encoding -> distributed representation

y = softmax(h*W2+b2)

softmax: 모든 y개 class 갯수만큼의 확률분포 합이 1이 되는 함수

Recurrent neural network

(sequential data)

  • RNN allow arbitarily-sized conditioning contexts; condition on the entire sequence history.

  • 순서의 특징을 잘 잡아낸다. 이전 history를 중요도에 따라 적용(학습)

  • 각각의 output을 바로 예측 가능

  • Each time step has two inputs:

    1. x(i)(the observation at time step i); one-hot vector(Word2Vec), feature vector or distributed representation(실수들의 집합).

    2. s(i-1)(the output of the previous state); base case: s(0)=0 vector

    s(i+1): 이전까지의 학습 정보를 다 담아둔 학습 벡터

  • s(i) = R(x(i), s(i-1)) : R computes the output state as a function of the current input and previous state

y(i) = O(s(i)) : O computes the output as a function of the current output space

"Simple" RNN

Different weight vectors W transform the previous state and current input before combining

g = tanh or relu

RNN LM

  • The output state s(i) is an H-dimensional real vector; we can transfer that into a probability by passing it through an additional linear transformation followed by a softmax

Training RNNs

  • At each time step, we make a prediction and incur a loss; we know the true y(the word we see in that position)

  • Training here is standard backpropagation, taking the derivative of the loss we incur at step t with respect to the parameters we want to update

Generation

  • As we sample, the words we generate form the new context we condition on

(RNN으로 챗봇 구현 가능)

Conditioned generation

  • In a basic RNN, the input at each timestep is a representation of the word at that position

  • But we can also condition on any arbitrary context(topic, author, date, metadata, dialect, etc.): 이전 current input(x(i))를 변형해서 넣을 수 있음

  • What information could you condition on in order to predict the next word?

  • Each state i encodes information seen until time i and its structure is optimized to predict the next word