Word2Vec - BD-SEARCH/MLtutorial GitHub Wiki

1. Word2vec์˜ ํ•„์š”์„ฑ

NLP (Natural Language Processing, ์ž์—ฐ์–ด์ฒ˜๋ฆฌ) : ์ปดํ“จํ„ฐ๊ฐ€ ์ธ๊ฐ„์ด ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ๋ถ„์•ผ

ex) 'Naive Bayes'๋ฅผ ์‚ฌ์šฉํ•œ ์ŠคํŒธ ๋ฉ”์ผ ๋ถ„๋ฅ˜๊ธฐ

  • ์„ฑ๋Šฅ ์ž์ฒด๋Š” ์ข‹์ง€๋งŒ, ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ๋‹จ์–ด์™€ ์–ด๋–ค ์ฐจ์ด์ ์„ ๊ฐ€์ง€๋Š”์ง€๋Š” ์ดํ•ดํ•  ์ˆ˜ ์—†๋‹ค -> ๋ฒกํ„ฐํ™” ๊ณ ์•ˆ

2. One-hot Encoding

  • One-hot vector: ๋ฒกํ„ฐ์˜ ์š”์†Œ ์ค‘ ํ•˜๋‚˜๋งŒ 1์ด๊ณ  ๋‚˜๋จธ์ง€๋Š” ์ „๋ถ€ 0์ธ ๋ฒกํ„ฐ
  • One-hot encoding
    • ๋‹จ์–ด๋ฅผ ๊ณ ์ • ๊ธธ์ด์˜ one-hot vector๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์‚ฌ์šฉ.
    • ๋‹จ์–ด ํ•œ ๊ฐœ๋‹น ์ฐจ์› ํ•˜๋‚˜์”ฉ ๊ฐ€์ง. ๋‹จ์–ด ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์•„์ง€๋ฉด ๋ฒกํ„ฐ ์ฐจ์› ์ˆ˜๋„ ๋น„๋ก€ํ•ด์„œ ๋งŽ์•„์ง„๋‹ค.
  • ๊ณ ์ • ๊ธธ์ด์˜ vector๋Š” ์‹ ๊ฒฝ๋ง ํ•™์Šต์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
  • Word2Vec ํ•™์Šต ์‹œ์—๋„ Input layer์™€ Output layer๋ฅผ One-hot vector๋กœ ์‚ฌ์šฉ
  • ๋‹จ์–ด ๊ฐ„์˜ ์—ฐ๊ด€์„ฑ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š” ์ธ์ฝ”๋”ฉ.
    • ๊ฐ๊ฐ์˜ ๋‹จ์–ด vector๊ฐ€ ๋ชจ๋‘ orthogonal(์ง๊ต).
    • ๋ชจ๋“  ๋‹จ์–ด์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋Š” 0. ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์—†์Œ.

example: "You say goodbye and I say hello."๋ผ๋Š” ๋ฌธ์žฅ์ด ์žˆ์„ ๋•Œ one-hot vector์˜ ์˜ˆ์ œ

  • you = [1, 0, 0, 0, 0, 0, 0]
  • goodbye = [0, 0, 1, 0, 0, 0, 0]

3. Word2vec ํ›ˆ๋ จํ•˜๊ธฐ

Distributional Hypothesis : ๋น„์Šทํ•œ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด๋Š” ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„๋‹ค

Word2vec(2013)

  • ๊ธฐ์กด Neural net๊ธฐ๋ฐ˜ ํ•™์Šต ๋ฐฉ๋ฒ•์—์„œ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚œ ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ ์—ฐ์‚ฐ๋Ÿ‰์„ ์—„์ฒญ ์ค„์˜€๋‹ค.

Word2vec์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๋„คํŠธ์›Œํฌ ๋ชจ๋ธ 2๊ฐ€์ง€

  • CBOW (Continuous Bag-of-words)
  • Skip-gram

1) CBOW

image

'์ง‘ ์•ž์—์„œ ์•„์ด์Šคํฌ๋ฆผ์„ ์‚ฌ ๋จน์—ˆ๋Š”๋ฐ, ___ ์‹œ๋ ค์„œ ๋„ˆ๋ฌด ๋จน๊ธฐ๊ฐ€ ํž˜๋“ค์—ˆ๋‹ค.'

์ฃผ์–ด์ง„ ๋‹จ์–ด์— ๋Œ€ํ•ด ์•ž๋’ค๋กœ C/2๊ฐœ ์”ฉ, ์ด C๊ฐœ์˜ ๋‹จ์–ด๋ฅผ Input์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์ฃผ์–ด์ง„ ๋‹จ์–ด๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•œ ๋„คํŠธ์›Œํฌ๋ฅผ ๋งŒ๋“ ๋‹ค.

a. ๊ตฌ์„ฑ

Input layer, projection layer(๊ทธ๋ฆผ ์ƒ์—์„  hidden layer), output layer

  1. Input layer์—์„œ projection layer๋กœ ๊ฐˆ ๋•Œ VxNํฌ๊ธฐ์˜ projection Matrix W๋ฅผ ๋งŒ๋“ ๋‹ค
    • N = projection layer์˜ ๊ธธ์ด / ์‚ฌ์šฉํ•  vector์˜ ํฌ๊ธฐ
  2. projection layer์—์„œ output layer๋กœ ๊ฐˆ ๋•Œ NxVํฌ๊ธฐ์˜ weight Matrix W'๋ฅผ ๋งŒ๋“ ๋‹ค.
    • W, W'๋Š” ๋ณ„๊ฐœ์˜ ํ–‰๋ ฌ
  3. Input์—์„œ ๋‹จ์–ด๋ฅผ one-hot encoding์œผ๋กœ ๋„ฃ์–ด์ค€๋‹ค.
  4. ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋‹จ์–ด๋ฅผ proejctionํ•ด ๊ทธ ๋ฒกํ„ฐ์˜ ํ‰๊ท ์„ ๊ตฌํ•ด์„œ projection layer์— ๋ณด๋‚ธ๋‹ค
  5. W'๋ฅผ ๊ณฑํ•ด์„œ output layer๋กœ ๋ณด๋‚ด softmax๊ณ„์‚ฐ์„ ํ•œ๋‹ค.
  6. 5.์˜ ๊ฒฐ๊ณผ๋ฅผ ์ •๋‹ต์˜ one-hot encoding๊ณผ ๋น„๊ตํ•˜์—ฌ error๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

b. CBOW์˜ ์—ฐ์‚ฐ๋Ÿ‰

  • C๊ฐœ์˜ ๋‹จ์–ด๋ฅผ Projection : CxN
  • projection layer์—์„œ output layer๋กœ ๊ฐ€๋Š” ๊ฒƒ : NxV

์ „์ฒด ์—ฐ์‚ฐ๋Ÿ‰ : CxN + NxV

V -> ln(V)๋กœ ๊ฐœ์„ ํ•˜๋ฉด ์ „์ฒด ์—ฐ์‚ฐ๋Ÿ‰ : CxN + Nxln(V)

c. CBOW๊ฐ€ ๋น ๋ฅธ ์ด์œ 

๋ณดํ†ต C๋ฅผ 10๋‚ด์™ธ์˜ ํฌ๊ธฐ๋กœ ์žก๋Š”๋‹ค.

์ „์ฒด ์—ฐ์‚ฐ๋Ÿ‰ : (projection layer์˜ ํฌ๊ธฐ N) * (ln(V)์˜ ํฌ๊ธฐ) ์— ๋น„๋ก€

C = 10, N=500, V=1,000,000๋กœ ํ•ด๋„ 10,000 ๋ฐ–์— ๋˜์ง€ ์•Š๋Š”๋‹ค.

2) Skip-gram

image

์ฃผ์–ด์ง„ ๋‹จ์–ด ํ•˜๋‚˜๋ฅผ ๊ฐ€์ง€๊ณ  ์ฃผ์œ„์— ๋“ฑ์žฅํ•˜๋Š” ๋‚˜๋จธ์ง€ ๋ช‡ ๊ฐ€์ง€์˜ ๋‹จ์–ด๋“ค์˜ ๋“ฑ์žฅ ์—ฌ๋ถ€๋ฅผ ์œ ์ถ”

  • ๊ฐ€๊นŒ์ด ์œ„์น˜ํ•ด ์žˆ๋Š” ๋‹จ์–ด์ผ ์ˆ˜๋ก ํ˜„์žฌ ๋‹จ์–ด์™€ ๊ด€๋ จ์ด ๋” ๋งŽ์€ ๋‹จ์–ด์ผ ๊ฒƒ
  • ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ์žˆ๋Š” ๋‹จ์–ด์ผ์ˆ˜๋ก ๋‚ฎ์€ ํ™•๋ฅ ๋กœ ํƒํ•œ๋‹ค

CBOW์™€ ๋ฐฉํ–ฅ๋งŒ ๋ฐ˜๋Œ€์ผ ๋ฟ ๋™์ž‘ํ•˜๋Š” ๋ฐฉ์‹์€ ์œ ์‚ฌ

a. Skip-gram์˜ ์—ฐ์‚ฐ๋Ÿ‰

C๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์ƒ˜ํ”Œ๋งํ•  ๋•Œ

  • ํ˜„์žฌ ๋‹จ์–ด๋ฅผ projectionํ•˜๋Š” ๋ฐ N
  • output์„ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ NxV (๊ฐœ์„ ํ•˜๋ฉด Nxln(V))
  • ์ด C๊ฐœ์˜ ๋‹จ์–ด์— ๋Œ€ํ•ด ์ง„ํ–‰ํ•˜๋ฏ€๋กœ *C

์—ฐ์‚ฐ๋Ÿ‰ : C(N + Nxln(V))

  • ๋ช‡ ๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜๋ƒ์— ๋”ฐ๋ผ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋น„๋ก€ํ•˜๋ฏ€๋กœ CBOW๋ณด๋‹ค๋Š” ํ•™์Šต์ด ๋Š๋ฆฌ๋‹ค
  • ์‹คํ—˜์—์„œ๋Š” Skip-gram์ด ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค

3) V to ln(V)

CBOW, Skip-gram์„ ๊ทธ๋Œ€๋กœ ๋Œ๋ฆฌ๋ฉด ํ•™์Šต์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค : V ๋•Œ๋ฌธ.

๋„คํŠธ์›Œํฌ์˜ output layer์—์„œ softmax๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด์„ 

  • ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•ด ์ „๋ถ€ ๊ณ„์‚ฐ์„ ํ•ด์„œ normalization์„ ํ•ด์•ผ ํ•œ๋‹ค
  • ์ถ”๊ฐ€์ ์ธ ์—ฐ์‚ฐ์ด ์—„์ฒญ๋‚˜๊ฒŒ ๋Š˜์–ด๋‚œ๋‹ค

์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ด ๋ถ€๋ถ„์˜ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ด๋ ค๋Š” ์‹œ๋„ ๋“ฑ์žฅ

a. Hierarchical Softmax

a-1. ๊ตฌ์„ฑ

softmax ๋Œ€์‹  multinomial distribution function์„ ์‚ฌ์šฉ

image

  • ๋‹จ์–ด ํ•˜๋‚˜์”ฉ์„ leave๋กœ ๊ฐ€์ง€๋Š” binary tree๋ฅผ ๋งŒ๋“ ๋‹ค
    • completeํ•  ํ•„์š”๋Š” ์—†์ง€๋งŒ, full์ด๋ฉด ์ข‹๋‹ค
      • complete : ๋งˆ์ง€๋ง‰ ๋ ˆ๋ฒจ ์ œ์™ธ ๋ชจ๋‘ ๊ฝ‰์ฑ„์›Œ์žˆ๊ณ , ์™ผ์ชฝ๋ถ€ํ„ฐ ์ฑ„์›Œ์ง„๋‹ค
      • full : ๋ชจ๋“  ๋…ธ๋“œ๊ฐ€ 0๊ฐœ/2๊ฐœ์˜ ์ž์‹ ๋…ธ๋“œ๋ฅผ ๊ฐ€์ง
  • ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด w1์˜ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•  ๋•Œ, root์—์„œ๋ถ€ํ„ฐ ๊ทธ leaf๊นŒ์ง€ ๊ฐ€๋Š” path์— ๋”ฐ๋ผ ํ™•๋ฅ ์„ ๊ณฑํ•œ๋‹ค

image

  • w : ๋‹จ์–ด. leaf์— ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ํ•˜๋‚˜์”ฉ ์กด์žฌํ•œ๋‹ค

  • L(w) : w๋ผ๋Š” leaf์— ๋„๋‹ฌํ•˜๊ธฐ ๊นŒ์ง€์˜ path์˜ ๊ธธ์ด

  • n(w,i) : root์—์„œ๋ถ€ํ„ฐ w๋ผ๋Š” leaf์— ๋„๋‹ฌํ•˜๋Š” path์—์„œ i๋ฒˆ์งธ ๋งŒ๋‚˜๋Š” ๋…ธ๋“œ

  • n(w,1) : root

  • n(w, L(w)) : w

  • ์ด ๋…ธ๋“œ ํ•˜๋‚˜ํ•˜๋‚˜์— vector v_n(w,i)๊ฐ€ ๋”ธ๋ ค์žˆ๋‹ค.

  • ch(node) : node์˜ ๊ณ ์ •๋œ ์ž„์˜์˜ ํ•œ child - ์—ฌ๊ธฐ์„œ๋Š” ์™ผ์ชฝ์ด child

  • x : if x: 1, else: -1

  • h : hidden layer์˜ ๊ฐ’ ๋ฒกํ„ฐ

  • sigma(x) : sigmoid function 1/(1+exp(-x))

hierarchial softmax๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ CBOW/Skip-gram์— ์žˆ๋˜ W' matrix๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค.

๋Œ€์‹  V-1๊ฐœ์˜ internal node๊ฐ€ ๊ฐ๊ฐ ๊ธธ์ด N์งœ๋ฆฌ weight vector๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋œ๋‹ค. : v'_i ๋ผ ํ•˜๊ณ  ํ•™์Šต์—์„œ update

  • P(w|wi) : input word๊ฐ€ wi์ผ ๋•Œ output word๊ฐ€ w์ผ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ 

์ด๋ฅผ ๊ตฌํ•˜๋ฉด P(w|wi)๋ฅผ maximizeํ•˜๊ธฐ ์œ„ํ•ด -log๋ฅผ ์ทจํ•ด์„œ objection function ์™„์„ฑ

a-2. p(w) ์˜ ์—ฐ์‚ฐ๋Ÿ‰

  • step๋งˆ๋‹ค ๊ธธ์ด N์งœ๋ฆฌ vector 2๊ฐœ์˜ ๋‚ด์ ์ด ์ผ์–ด๋‚œ๋‹ค : N

  • binary tree๋ฅผ ์ž˜ ๋งŒ๋“ค์—ˆ์„ ๊ฒฝ์šฐ ํ‰๊ท ์ ์œผ๋กœ root๋กœ๋ถ€ํ„ฐ leaf๊นŒ์ง€์˜ ํ‰๊ท  ๊ฑฐ๋ฆฌ : O(ln(V))

์ด ์—ฐ์‚ฐ๋Ÿ‰ : Nxln(V)

error function์„ categorical cross-entropy๋กœ ์“ธ ๊ฒฝ์šฐ ์ตœ์ข… error๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋งŒ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜๋ฉด ๋˜๋ฏ€๋กœ, ๋‹ค๋ฅธ ์ถ”๊ฐ€์ ์ธ ์—ฐ์‚ฐ ์—†์ด O(Nxln(V))๋งŒ์œผ๋กœ ๋๋‚œ๋‹ค

a-3. ์ˆ˜์‹ ๊ณ„์‚ฐ

ํŠน์ • hidden layer์— ๋Œ€ํ•ด ๋ชจ๋“  ๋‹จ์–ด๋“ค์˜ ํ™•๋ฅ ์„ ๋”ํ•œ sigma{ p(wi|hidden layer) }

full binary tree๋ผ ๊ฐ€์ •ํ•˜๊ณ , v_node์™€ h์˜ ๋‚ด์ ๊ฐ’์„ x๋ผ ํ•  ๋•Œ,

  • ํŠน์ • node์—์„œ ์™ผ์ชฝ child๋กœ ๊ฐˆ ํ™•๋ฅ  : sigmoid(x)
  • ํŠน์ • node์—์„œ ์˜ค๋ฅธ์ชฝ child๋กœ ๊ฐˆ ํ™•๋ฅ  : sigmoid(-x) = 1-sigmoid(x)
  • ์œ„ ๋‘๊ฐœ๋ฅผ ๋”ํ•˜๋ฉด 1

์ฆ‰, sigma{ p(wi|hidden layer) } = 1

softmax ๊ณ„์‚ฐ์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋Š” ์ด์œ  : ํ™•๋ฅ  ๊ณ„์‚ฐ์„ ์œ„ํ•ด ๋ชจ๋“  ๊ฒฐ๊ณผ ํ•ฉ์„ 1๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•จ

  • output์— ๋Œ€ํ•ด ์ผ์ผ์ด ๊ณ„์‚ฐ์„ ํ•ด ์ „์ฒด ํ•ฉ์œผ๋กœ normalizeํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— V๋งŒํผ์˜ ๊ณ„์‚ฐ์ด ๋” ํ•„์š”ํ–ˆ๋˜ ๊ฒƒ
  • Hierarchical softmax๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ „์ฒด ํ™•๋ฅ ์— ๋Œ€ํ•œ ๊ณ„์‚ฐ ์—†์ด ์ „์ฒด ํ•ฉ์„ 1๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

a-4. ๋งˆ๋ฌด๋ฆฌ

  • word2vec์—์„œ๋Š” Binary Tree๋กœ Binary Huffman Tree ์‚ฌ์šฉ
    • ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋ฅผ ์งง์€ path๋กœ ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ
    • full binary tree๋‹ค

b. Negative Sampling

Hierarchical softmax์˜ ๋Œ€์ฒด์žฌ๋กœ ์‚ฌ์šฉ

์•„์ด๋””์–ด : softmax๋Š” ๋„ˆ๋ฌด ๋งŽ์€ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•ด์„œ ๊ณ„์‚ฐํ•˜๋ฏ€๋กœ, ๋ช‡ ๊ฐœ๋งŒ ์ƒ˜ํ”Œ๋งํ•ด๋ณด์ž

  • ์ „์ฒด ๋‹จ์–ด์— ๋Œ€ํ•ด ๊ณ„์‚ฐํ•˜์ง€ ์•Š๊ณ , ์ผ๋ถ€๋งŒ ๋ฝ‘์•„์„œ softmax ๊ณ„์‚ฐ์„ ํ•œ ํ›„ normalization
  • ์—ฐ์‚ฐ๋Ÿ‰ : NxV -> NxK (K=๋ฝ‘์€ sample์˜ ์ˆ˜)
    • ์ด ๋•Œ, target ๋‹จ์–ด๋Š” ๋ฐ˜๋“œ์‹œ ๊ณ„์‚ฐ์„ ํ•ด์•ผํ•˜๋ฏ€๋กœ : positive sample
    • ๋‚˜๋จธ์ง€ : negative sample
      • negative sample์„ ์–ด๋–ป๊ฒŒ ๊ฒฐ์ •ํ•˜๋ƒ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์ง„๋‹ค.
      • ๋ณดํ†ต ์„ฑ๋Šฅ์ ์œผ๋กœ ๊ฒฐ์ •

b-1. error function

image

์ƒˆ๋กœ์šด error function

์œ„์™€ ๊ฐ™์€ ์‹์„ maximizeํ•˜๋„๋ก weight๋ฅผ ์กฐ์ •ํ•œ๋‹ค

  • postivie sample
    • image
  • negative sample
    • image

๋…ผ๋ฌธ ์ฐธ๊ณ 

  • ํ˜„์žฌ ๋ณด๊ณ ์žˆ๋Š” ๋‹จ์–ด w, ๋ชฉํ‘œ๋กœ ํ•˜๊ณ  ์žˆ๋Š” ๋‹จ์–ด c๋ฅผ ์„ ์ • -> (w,c)
  • positive sample : (w,c)์ด ์ด corpus์— ์žˆ์„ ํ™•๋ฅ 
  • negative sample : (w,c)์ด ์ด corpus์— ์—†์„ ํ™•๋ฅ 
  • ๊ฐ๊ฐ์„ ๋”ํ•ด log๋ฅผ ์ทจํ•œ๋‹ค.

๋ณดํ†ต negative sampling์—์„œ sample์„ ๋ฝ‘๋Š” ๊ฒƒ์€ noise distribution์„ ์ •์˜ํ•˜๊ณ  ๊ทธ ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด๋“ค์„ ์ผ์ • ๊ฐฏ์ˆ˜ ๋ฝ‘์•„์„œ ์‚ฌ์šฉํ•œ๋‹ค.

  • ๊ทผ๋ฐ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๋Ÿฌ ๋ถ„ํฌ๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ์‚ฌ์šฉํ•ด๋ณธ ๊ฒฐ๊ณผ unigram distribution์˜ 3/4์Šน์„ ์ด์šฉํ•˜๋ฉด unigram, uniform๋ณด๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค
    • unigram distribution : ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜๋Š” ๋น„์œจ์— ๋น„๋ก€ํ•˜๊ฒŒ ํ™•๋ฅ ์„ ์„ค์ •ํ•˜๋Š” ๋ถ„ํฌ
    • ๊ฐ ํ™•๋ฅ ์„ 3/4์Šน ํ•˜๊ณ  normalization factor๋กœ ๋‚˜๋ˆ„์–ด ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•  ํ™•๋ฅ ๋กœ ์‚ฌ์šฉ

c. Subsampling frequent words

Hierachical softmax, negative sampling์€ ๋ชจ๋ธ ์ž์ฒด์˜ ๊ณ„์‚ฐ๋ณต์žก๋„๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•จ

์ถ”๊ฐ€์ ์œผ๋กœ 'the', 'a', 'in'๊ณผ ๊ฐ™์ด ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์„ ํ™•๋ฅ ์ ์œผ๋กœ ์ œ์™ธ

  • ํ•™์Šต ์†๋„, ์„ฑ๋Šฅ ํ–ฅ์ƒ

c-1. how to

image

  • f(w) : ๋‹จ์–ด w์˜ ๋“ฑ์žฅ ๋นˆ๋„
  • t : ๋นˆ๋„๊ฐ€ ์ผ์ • ๊ฐ’ ์ด์ƒ์ผ ๋•Œ๋งŒ ์ œ์™ธํ•˜๊ฒ ๋‹ค๋Š” threshold
    • ๋…ผ๋ฌธ์—์„œ๋Š” t = 10^-5์ผ ๋•Œ ๊ฐ€์žฅ ๊ฒฐ๊ณผ ์ข‹์•˜๋‹ค

ํ•™์Šตํ•  ๋•Œ ๊ฐ ๋‹จ์–ด๋Š” P(wi)ํ™•๋ฅ ๋กœ ์ œ์™ธ๋œ๋‹ค.


reference :