[텍스트마이닝][4주차 1] Language Model - mingoori0512/minggori GitHub Wiki

Language Model

We can use multiclass logistic regression for language modeling by treating the vocabulary as the output space

y = V

P(Y=y|X=x,b): 이전 단어들이 등장했을 때, 다음 단어들이 등장할 확률

Log-linear models give us the flexibility of encoding richer representations of the context we are conditioning on.
We can reason about any observations from the entire history and not just the local context.

The United States Senate opens its second impeachment trial of former President Donald J. __________

feature classes & Examples

ngrams

gappy nagrams(자주 등장하는 단어들)

spelling, capitalization

class/gazetteer membership

Simple feed-forward multilayer perceptron(e.g. one hidden layer)

input x = vector concatenation(이어 붙이다) of a conditioning context of fixed size k

x = [V(W1);...;V(Wk)]

one-hot encoding -> distributed representation

y = softmax(h*W2+b2)

softmax: 모든 y개 class 갯수만큼의 확률분포 합이 1이 되는 함수

(sequential data)

RNN allow arbitarily-sized conditioning contexts; condition on the entire sequence history.
순서의 특징을 잘 잡아낸다. 이전 history를 중요도에 따라 적용(학습)
각각의 output을 바로 예측 가능
Each time step has two inputs:
1. x(i)(the observation at time step i); one-hot vector(Word2Vec), feature vector or distributed representation(실수들의 집합).
2. s(i-1)(the output of the previous state); base case: s(0)=0 vector
s(i+1): 이전까지의 학습 정보를 다 담아둔 학습 벡터
s(i) = R(x(i), s(i-1)) : R computes the output state as a function of the current input and previous state

y(i) = O(s(i)) : O computes the output as a function of the current output space

Different weight vectors W transform the previous state and current input before combining

g = tanh or relu

The output state s(i) is an H-dimensional real vector; we can transfer that into a probability by passing it through an additional linear transformation followed by a softmax

At each time step, we make a prediction and incur a loss; we know the true y(the word we see in that position)
Training here is standard backpropagation, taking the derivative of the loss we incur at step t with respect to the parameters we want to update

(RNN으로 챗봇 구현 가능)

In a basic RNN, the input at each timestep is a representation of the word at that position
But we can also condition on any arbitrary context(topic, author, date, metadata, dialect, etc.): 이전 current input(x(i))를 변형해서 넣을 수 있음
What information could you condition on in order to predict the next word?
Each state i encodes information seen until time i and its structure is optimized to predict the next word