[텍스트마이닝][4주차 1] Language Model - mingoori0512/minggori GitHub Wiki
Language Model
- We can use multiclass logistic regression for language modeling by treating the vocabulary as the output space
y = V
P(Y=y|X=x,b): 이전 단어들이 등장했을 때, 다음 단어들이 등장할 확률
Unigram LM
- A unigram language model here would have just one feature: a bias term
Richer representations(x를 어덯게 표현할 것인가..one-hot encoding or binary encoding)
-
Log-linear models give us the flexibility of encoding richer representations of the context we are conditioning on.
-
We can reason about any observations from the entire history and not just the local context.
Examples
The United States Senate opens its second impeachment trial of former President Donald J. __________
feature classes & Examples
ngrams
gappy nagrams(자주 등장하는 단어들)
spelling, capitalization
class/gazetteer membership
Neural LM
Simple feed-forward multilayer perceptron(e.g. one hidden layer)
input x = vector concatenation(이어 붙이다) of a conditioning context of fixed size k
x = [V(W1);...;V(Wk)]
one-hot encoding -> distributed representation
y = softmax(h*W2+b2)
softmax: 모든 y개 class 갯수만큼의 확률분포 합이 1이 되는 함수
Recurrent neural network
(sequential data)
-
RNN allow arbitarily-sized conditioning contexts; condition on the entire sequence history.
-
순서의 특징을 잘 잡아낸다. 이전 history를 중요도에 따라 적용(학습)
-
각각의 output을 바로 예측 가능
-
Each time step has two inputs:
-
x(i)(the observation at time step i); one-hot vector(Word2Vec), feature vector or distributed representation(실수들의 집합).
-
s(i-1)(the output of the previous state); base case: s(0)=0 vector
s(i+1): 이전까지의 학습 정보를 다 담아둔 학습 벡터
-
-
s(i) = R(x(i), s(i-1)) : R computes the output state as a function of the current input and previous state
y(i) = O(s(i)) : O computes the output as a function of the current output space
"Simple" RNN
Different weight vectors W transform the previous state and current input before combining
g = tanh or relu
RNN LM
- The output state s(i) is an H-dimensional real vector; we can transfer that into a probability by passing it through an additional linear transformation followed by a softmax
Training RNNs
-
At each time step, we make a prediction and incur a loss; we know the true y(the word we see in that position)
-
Training here is standard backpropagation, taking the derivative of the loss we incur at step t with respect to the parameters we want to update
Generation
- As we sample, the words we generate form the new context we condition on
(RNN으로 챗봇 구현 가능)
Conditioned generation
-
In a basic RNN, the input at each timestep is a representation of the word at that position
-
But we can also condition on any arbitrary context(topic, author, date, metadata, dialect, etc.): 이전 current input(x(i))를 변형해서 넣을 수 있음
-
What information could you condition on in order to predict the next word?
-
Each state i encodes information seen until time i and its structure is optimized to predict the next word