word2vec: Efficient Estimation of Word Representations in Vector Space - Deepest-Project/Greedy-Survey GitHub Wiki

Resources

word2vec: Efficient Estimation of Word Representations in Vector Space arXiv

More dercription about representations in word2vec:
Distributed Representations of Words and Phrases and their Compositionality link

์ •๋ฆฌํ•ด ๋†“์€ ๊ธ€ + ์˜์ƒ shuuki4's wordpress
YouTube

Background

Softmax with very large label

Hierarchical Softmax

Softmax์—์„œ V๊ฐœ์˜ ๋ถ„๋ฅ˜๋ฅผ ํ•œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ์ „์ฒด ํ•ฉ์„ ๊ตฌํ•ด ๋‚˜๋ˆ ์ฃผ๋Š” ๊ณผ์ •์ด ๋งค์šฐ Costlyํ•จ(V๊ฐ€ ๋งค์šฐ ํด ๋•Œ).

Hierarchical Softmax์€ ์ „์ฒด ํ•ฉ์„ ๊ตฌํ•ด์„œ ๋‚˜๋ˆ„๋Š” ๊ณผ์ • ์—†์ด๋„ ์ „์ฒด ํ•ฉ์ด ์ฒ˜์Œ๋ถ€ํ„ฐ 1์ด ๋˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•.

Softmax์˜ Computational Complexity๊ฐ€ V ์—์„œ log2(V)๋กœ ๊ฐ์†Œํ•œ๋‹ค.

Negative Sampling

Softmax์—์„œ Cross Entrophy๋ฅผ ๊ตฌํ•  ๋•Œ, 1์ด ๊ณฑํ•ด์ง€๋Š” "Positive Sample"์€ ์ค‘์š”ํ•˜์ง€๋งŒ, 0์ด ๊ณฑํ•ด์ง€๋Š” "Negative Sample"์€ ๊ทธ๋‹ค์ง€ ์ค‘์š”ํ•˜์ง€ ์•Š๋‹ค.

๊ทธ๋ฆฌํ•˜์—ฌ Neative Sample ๋ชจ๋‘์— ๋Œ€ํ•ด ๊ณ„์‚ฐ์„ ํ•˜์ง€ ์•Š๊ณ , K๊ฐœ์˜ Sample์„ ์„ ํƒํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค.

word2vec์—์„œ๋Š” ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๊ณผ๋Š” ์กฐ๊ธˆ ๋‹ค๋ฅธ Loss Function์„ ์ •์˜ํ•œ๋‹ค.
์›๋ž˜ Loss Function์—๋Š” ์•„๋ž˜์—์„œ ๋งํ•˜๋Š” $\mathbb{E}_{w_i \sim P_n(w)}[\sim]$๊ฐ€ ์—†์ด, ๊ทธ๋ƒฅ Sum์„ ํ•œ๋‹ค.

$\mathbb{E}_{w_i \sim P_n(w)}[\sim]$๋Š”, Negative Sample์—๋„ ๊ฐ๊ฐ ๋“ฑ์žฅ ๋นˆ๋„์— ๋”ฐ๋ผ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ–ˆ์Œ์„ ์˜๋ฏธํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์–ด๋–ค ๋‹จ์–ด ๊ทผ์ฒ˜์— "๋ง›" ์ด๋ผ๋Š” ๋นˆ๋„์ˆ˜๊ฐ€ ๊ต‰์žฅํžˆ ๋†’์€ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜์ง€ ์•Š์•˜๋‹ค๋ฉด, ์ด๋Š” ๊ฝค๋‚˜ ์˜๋ฏธ ์žˆ๋Š” ์ •๋ณด์ด๋‹ค. ๋ฐ˜๋ฉด์— ์–ด๋–ค ๋‹จ์–ด ๊ทผ์ฒ˜์— "์ง„ํ์ฆ" ์ด๋ผ๋Š” ๋นˆ๋„์ˆ˜๊ฐ€ ๊ต‰์žฅํžˆ ๋‚ฎ์€ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜์ง€ ์•Š์•˜๋‹ค๋ฉด, ์ด๋Š” ์ „์ž์— ๋น„ํ•ด ํฌ๊ฒŒ ์˜๋ฏธ์žˆ๋Š” ์ •๋ณด๊ฐ€ ์•„๋‹ ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ, ์ „์ฒด ๋นˆ๋„์ˆ˜๋ฅผ ๊ฐ€์ค‘์น˜๋กœ ํ•˜์—ฌ Negative Sample์„ ํ•ฉ์‚ฐํ•œ๋‹ค.

Introduction

Word Vector Representation

Cons of one-hot encoding:

  • Use Too Many Memories.
  • It can not represent the relationships between words.

Vector Representation vectorizes the words by their "meaning".
For example,

  • vector("Seoul") - vector("Korea") + vector("Japan") = vector("Tokyo")
  • average(vector("Man"), vector("Woman"), vector("Kid"), vector("Adult")) = vector("Person")

Previous Models on Word Vector Representation

  • Feedforward Neural Net Language Model(NNLM)
  • Recurrent Neural Net Language Model(RNNLM)

NNLM

In the image, the the dimension of projection layer is written in P but in the paper it is written in D so I'll use D for that dimension below.

V : Size of dictionary D : Dimension of word vector

  1. N previous words from target word are encoded using one-hot encoding(Input layer would be the matrix (N * V) ).
  2. Input layer projected to a projection layer P that has dimensionality (N * D) , and then this layer get to be flattened.
  3. By using the projection layer P as input, through hidden layer output layer (V, ) is computed.
  4. Softmax, Cross-Entrophy, Back-Prop!
  5. Each row vectors of projection layer are used as "word vector"

Computational Complexity

$Q = N \times D + N \times D \times H + H \times V $

$V$ can be reduced to $log_2(V)$, so bottleneck term is $N \times D \times H$

Cons

  • Too slow Usually, N=10, D=500, H=500 -> $Q \approx 2.5M$

RNNLM

Computational Complexity

$Q = H \times H + H \times V $

Cons

  • Still... slow bottleneck term is $H \times H$ so, $ Q \approx 250K$

word2vec Models

  • Continuous Bag-of-Words Model(CBOW)
  • Continuous Skip-gram Model

CBOW

๋‚˜๋Š” ๋กฏ๋ฐ๋ฆฌ์•„์—์„œ ๊ฐ์žํŠ€๊น€์„ ___ ์— ์ฐ์–ด ๋จน์—ˆ๋‹ค.
___ ์— ๋“ค์–ด์˜ฌ ๋‹จ์–ด๋ฅผ ๋งž์ถ”๊ธฐ!

Target Word์˜ ์•ž๋’ค ํ•ฉ์ณ์„œ C๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ํ™•์ธํ•œ๋‹ค.

$W = \begin{bmatrix} \bm{v_{w_1}} \ \vdots \ \bm{v_{w_V}} \end{bmatrix}$
(V, N) shape

$W' = \begin{bmatrix} \bm{v'_ {w_1}} & \cdots & \bm{v'_ {w_V}} \end{bmatrix}$
(N, V) shape

์—ฌ๊ธฐ์„œ, W์˜ Row vector๋“ค์ด word vector๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

Computational Complexity

$Q = C \times D + D \times log_2(V) $

C=10, D=500, V=1,000,000 -> $Q \approx 10K$

Skip-gram

CBOW๋ฅผ ๋ฐ˜๋Œ€๋กœ ๋’ค์ง‘์€ ๋ชจ๋ธ.

Input: ์ผ€์ฐน
์ฃผ๋ณ€์— ๊ฐ์žํŠ€๊น€, ๋กฏ๋ฐ๋ฆฌ์•„, ์ฐ์–ด ๊ฐ€ ๋“ฑ์žฅํ•˜๋Š”๊ฐ€?

$W = \begin{bmatrix} \bm{v_{w_1}} \ \vdots \ \bm{v_{w_V}} \end{bmatrix}$
(V, N) shape

$W' = \begin{bmatrix} \bm{v'_ {w_1}} & \cdots & \bm{v'_ {w_V}} \end{bmatrix}$
(N, V) shape

์—ฌ๊ธฐ์„œ, W์˜ Row vector๋“ค์ด word vector๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

Computational Complexity

$Q = C \times (D + D \times log_2(V)) $

C=10, D=500, V=1,000,000 -> $Q \approx 100K$

CBOW ๋ชจ๋ธ๊ณผ Skip-gram ๋ชจ๋ธ์„ ๋น„๊ตํ•˜๋ฉด CBOW์˜ ๋ชจ๋ธ์ด ์กฐ๊ธˆ ๋” ๋…ผ๋ฆฌ์ ์ด๋ผ๊ณ  ๊ฐœ์ธ์ ์œผ๋กœ ์ƒ๊ฐํ•˜์ง€๋งŒ, ์‹ค์ œ ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ๋Š” Skip-gram์ด CBOW์— ๋น„ํ•ด ์ „์ฒด์ ์œผ๋กœ ๋‹ค์†Œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ์ถ”์„ธ๋ฅผ ๋ณด์ธ๋‹ค. ํ˜„์žฌ๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์‚ฌ๋žŒ๋“ค์ด Skip-gram ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค [3].

์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ Vector๋“ค์€, cosine similarity๋ฅผ ํ†ตํ•ด ์—ฐ์‚ฐ๋œ๋‹ค. ์ฆ‰์Šจ, v = vector("Seoul") - vector("Korea") + vector("Japan")์ด๋ฉด,
vector space์—์„œ v์™€ cosine distance๊ฐ€ ๊ฐ€์žฅ ์ž‘์€ vector(์•„๋งˆ๋„ vector("Tokyo")๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค.)๋ฅผ ์ฑ„ํƒํ•œ๋‹ค.

Training

Training Wordset: Semmantic, Syntactic ๋‘ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋‚˜๋ˆ ์„œ ์ง„ํ–‰

Corpus: Google News(About 6B Words)

Trainingํ•  ๋•Œ 100M Words, 3 Epochs ๋ณด๋‹ค 300M Words๋ฅผ 1 Epoch ์ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Œ.

Performance

Examples of the Learned Relationships

  • uranium: plutonium ๊ฐ™์€ ์˜ค๋‹ต์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • USA:pizza, ์ฆ‰ USA๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” ์Œ์‹์ด pizza๋ผ๋Š” ๋œป์ธ๋ฐ, ์ด์ฒ˜๋Ÿผ word vector๋ฅผ ์ด์šฉํ•˜๋ฉด ์‰ฝ๊ฒŒ ์ƒ๊ฐํ•  ์ˆ˜ ์—†๋Š” ์‚ฌ์‹ค์„ ์œ ์ถ”ํ•  ์ˆ˜ ์žˆ๋‹ค.

Discussion

  • Emoji์˜ vector representation์€ ์–ด๋–ป๊ฒŒ ๋‚˜ํƒ€๋‚  ๊ฒƒ์ธ๊ฐ€?
  • ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์–ธ์–ด์˜ vector space ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์–ด๋–ป๊ฒŒ ์—ฐ๊ตฌํ•  ์ˆ˜ ์žˆ์„๊นŒ?