Text Embedding - BD-SEARCH/MLtutorial GitHub Wiki

์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์˜ ์ž…๋ ฅ๊ฐ’์€ ์ˆซ์ž์ด๋‹ค. ํ…์ŠคํŠธ๋ฅผ ๋”ฅ๋Ÿฌ๋‹์— ์ด์šฉํ•˜๋ ค๋ฉด ์ˆซ์ž ํ˜•ํƒœ๋กœ ๋˜์–ด ์žˆ๋Š” Vector ํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ”์ฃผ์–ด์•ผ ํ•œ๋‹ค. ์ด ๊ฒƒ์„ ๋‹จ์–ด/๋ฌธ์žฅ/๊ธ€์„ ์ž„๋ฒ ๋”ฉ(Embedding)ํ•œ๋‹ค๊ณ  ๋ถ€๋ฅธ๋‹ค.

1. ํ†ต๊ณ„ ๊ธฐ๋ฐ˜ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ

1-1. ๋ถ„ํฌ ๊ฐ€์„ค(distributional hypothesis)

  • ๋‹จ์–ด์˜ ์˜๋ฏธ๋Š” ์ฃผ๋ณ€ ๋‹จ์–ด์— ์˜ํ•ด ํ˜•์„ฑ๋œ๋‹ค
  • ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ์ตœ๊ทผ ์—ฐ๊ตฌ๋„ ๋Œ€๋ถ€๋ถ„ ์ด ๊ฐ€์„ค์„ ๋”ฐ๋ผ ํ˜•์„ฑ๋จ

1-2. ๋ถ„์‚ฐ ํ‘œํ˜„(distributional representation)

  • ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒกํ„ฐ ํ‘œํ˜„

1-3. ๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ

  • ์–ด๋–ค ๋‹จ์–ด์— ์ฃผ๋ชฉํ–ˆ์„ ๋•Œ ๊ทธ ์ฃผ๋ณ€์— ๋‹จ์–ด๊ฐ€ ๋ช‡ ๋ฒˆ ๋“ฑ์žฅํ•˜๋Š”์ง€๋ฅผ ์„ธ์–ด ์ง‘๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•
  • you say goodbye and I say hello.๋ผ๋Š” ๋ฌธ์žฅ์ด ์žˆ์„ ๋•Œ, window size=1์ธ ๊ฒฝ์šฐ ๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ์€ ์•„๋ž˜์™€ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.
  • ๋ฒกํ„ฐ ๊ฐ„ ์œ ์‚ฌ๋„๋Š” ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ์ด์šฉํ•ด ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.
word you say goodbye and I say hello .
you 0 1 0 0 0 0 0 0
say 1 0 1 0 1 1 1 0

1-4. ์ ๋ณ„ ์ƒํ˜ธ์ •๋ณด๋Ÿ‰(Pointwise Mutual Information, PMI)

image

1-5. ์ฐจ์› ๊ฐ์†Œ

  • ์ค‘์š”ํ•œ ์ •๋ณด๋Š” ์ตœ๋Œ€ํ•œ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ฒกํ„ฐ์˜ ์ฐจ์›์„ ์ค„์ด๋Š” ๊ธฐ๋ฒ•

1) ํŠน์ด๊ฐ’๋ถ„ํ•ด(Singular Value Decomposition, SVD)

  • X = USV^T)
  • U, V๋Š” ์ง๊ตํ–‰๋ ฌ(orthogonal matrix)์ด๊ณ , ๊ทธ ์—ด๋ฒกํ„ฐ๋Š” ์„œ๋กœ ์ง๊ตํ•œ๋‹ค.
  • S๋Š” ๋Œ€๊ฐํ–‰๋ ฌ(digonal matix). ํ•ด๋‹น ์ถ•์˜ ์ค‘์š”๋„๋กœ ๊ฐ„์ฃผํ•  ์ˆ˜ ์žˆ๋Š” ํŠน์ž‡๊ฐ’(singular value)์ด ํฐ ์ˆœ์„œ๋Œ€๋กœ ๋‚˜์˜๋˜์–ด ์žˆ์Œ.
  • S ํ–‰๋ ฌ์—์„œ, ํŠน์ž‡๊ฐ’์ด ์ ์€ ์›์†Œ๋ฅผ ์ž˜๋ผ๋‚ด๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ฒกํ„ฐ ์ฐจ์›์„ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

2. Word2vec์˜ ๋“ฑ์žฅ

  • NLP (Natural Language Processing, ์ž์—ฐ์–ด์ฒ˜๋ฆฌ) : ์ปดํ“จํ„ฐ๊ฐ€ ์ธ๊ฐ„์ด ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ๋ถ„์•ผ
    • ex) 'Naive Bayes'๋ฅผ ์‚ฌ์šฉํ•œ ์ŠคํŒธ ๋ฉ”์ผ ๋ถ„๋ฅ˜๊ธฐ
  • ์„ฑ๋Šฅ ์ž์ฒด๋Š” ์ข‹์ง€๋งŒ, ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ๋‹จ์–ด์™€ ์–ด๋–ค ์ฐจ์ด์ ์„ ๊ฐ€์ง€๋Š”์ง€๋Š” ์ดํ•ดํ•  ์ˆ˜ ์—†๋‹ค -> ๋ฒกํ„ฐํ™” ๊ณ ์•ˆ

NNLM -> RNNLM -> CBOW, Skip-gram (2013) -> word2vec

2-1. word2vec ์ด์ „์˜ ๋ชจ๋ธ

1) NNLM(Feed-Forward Neural Net Language Model)

image

a. ๊ตฌ์„ฑ

Input layer, projection layer, hidden layer, output layer

  1. ๋ฌธ์žฅ์—์„œ ํ˜„์žฌ ๋ณด๊ณ  ์žˆ๋Š” ๋‹จ์–ด ์ด์ „์˜ ๋‹จ์–ด N๊ฐœ๋ฅผ one-hot encoding์œผ๋กœ vectorํ™” ํ•œ๋‹ค
  2. vocab size๋ฅผ V๋ผ ํ•˜๊ณ , projection layer์˜ size๋ฅผ P๋ผ ํ–ˆ์„ ๋•Œ, 1.์˜ vector๋“ค์€ VxP ํฌ๊ธฐ์˜ projection matrix์— ์˜ํ•ด ๋‹ค์Œ ๋ ˆ์ด์–ด๋กœ ๋“ค์–ด๊ฐ„๋‹ค
  3. 2.์˜ ๊ฐ’์„ input์ด๋ผ ์ƒ๊ฐํ•˜๊ณ , ํฌ๊ธฐ H์˜ hidden layer๋ฅผ ๊ฑฐ์ณ output layer์—์„œ ๊ฐ ๋‹จ์–ด๋“ค์ด ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•œ๋‹ค
  4. ์ด๋ฅผ back-propagationํ•ด์„œ ๋„คํŠธ์›Œํฌ์˜ weight๋ฅผ ์ตœ์ ํ™”ํ•œ๋‹ค
  • 1-4์—์„œ ์‚ฌ์šฉํ•˜๊ฒŒ ๋  ๋‹จ์–ด์˜ vector๋“ค์€ projection layer์˜ ๊ฐ’์œผ๋กœ์„œ, ๊ฐ ๋‹จ์–ด๋“ค์€ ํฌ๊ธฐ P์˜ vector๊ฐ€ ๋œ๋‹ค
  • ํ˜„์žฌ์˜ neural network ๊ธฐ๋ฐ˜ ๋‹จ์–ด ํ•™์Šต ๋ชจ๋ธ์€ ๊ฑฐ์˜ ์ด ๋ชจ๋ธ์—์„œ๋ถ€ํ„ฐ ๋ฐœ์ „

b. NNLM์˜ ๋‹จ์ 

  • ๋ช‡๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ๋ณผ ๊ฑด์ง€์— ๋Œ€ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ N์ด ๊ณ ์ •๋˜์–ด์•ผ ํ•˜๊ณ , ๋”ฐ๋กœ ์ •ํ•ด์ค˜์•ผ ํ•œ๋‹ค
  • ์ด์ „์˜ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•ด์„œ๋งŒ ์‹ ๊ฒฝ์“ฐ๊ณ  ๋’ค์˜ ๋‹จ์–ด๋Š” ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•œ๋‹ค
  • ๋Š๋ฆฌ๋‹ค (์น˜๋ช…์ )

c. ๊ณ„์‚ฐ ์‹œ๊ฐ„

  • ๋‹จ์–ด๋“ค์„ projection ์‹œํ‚ค๋Š” ๊ฒƒ (2) : NxP
  • projection layer์—์„œ hidden layer๋กœ ๋„˜์–ด๊ฐ€๋Š” ๊ฒƒ (3) : NxPxH
  • hidden layer์—์„œ output layer๋กœ ๋„˜์–ด๊ฐ€๋ ค๋ฉด ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•œ ํ™•๋ฅ  ๊ณ„์‚ฐ (3) : HxV

=> parameter ๊ฐœ์ˆ˜ = NxP + NxPxH + HxV

  • ๋ณดํ†ต vocab size V๋Š” 1,000๋งŒ ๊ฐœ ์ •๋„. N=10, P=500, H=500.

=> O(HxV) = O(50์–ต)

d. ๊ฐœ์„ 

HxV -> Hxln(V) : O(NxPxH) = O(250๋งŒ)

2) RNNLM(Recurrent Neural Network)

image

  • NNLM์„ RNNํ˜•ํƒœ๋กœ ๋ณ€ํ˜• (projection layer๊ฐ€ ์—†๋‹ค)

a. ๊ตฌ์„ฑ

Input, hidden, output layer

hidden layer์— recurrentํ•œ ์—ฐ๊ฒฐ์ด ์žˆ์–ด ์ด์ „ ์‹œ๊ฐ„์˜ hidden layer์˜ ์ž…๋ ฅ์ด ๋‹ค์‹œ ์ž…๋ ฅ๋˜๋Š” ํ˜•์‹

  • U : word embedding์œผ๋กœ ์‚ฌ์šฉ
  • H : hidden layer์˜ ํฌ๊ธฐ

๊ฐ ๋‹จ์–ด๋Š” ๊ธธ์ด H์˜ vector๋กœ ํ‘œํ˜„๋œ๋‹ค.

b. RNNLM์˜ ํŠน์ง•

  • NNLM๊ณผ ๋‹ค๋ฅด๊ฒŒ ๋ช‡ ๊ฐœ์˜ ๋‹จ์–ด์ธ์ง€์— ๋Œ€ํ•ด ์ •ํ•ด์ค„ ํ•„์š”๊ฐ€ ์—†๋‹ค.
  • ํ•™์Šต์‹œ์ผœ์ค„ ๊ธ€์˜ ๋‹จ์–ด๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต ์ง„ํ–‰
    • recurrentํ•œ ๋ถ€๋ถ„์ด short-term memory ์—ญํ• ์„ ํ•˜๋ฉด์„œ ์ด์ „ ๋‹จ์–ด๋ฅผ ๋ณด๋Š” ํšจ๊ณผ

c. RNNLM์˜ ์—ฐ์‚ฐ๋Ÿ‰

  • input layer์—์„œ hidden layer๋กœ ๋„˜์–ด๊ฐ€๋Š” ๊ฒƒ : H
  • hidden(t-1)์—์„œ hidden(t)๋กœ ๋„˜์–ด๊ฐ€๋Š” vector ๊ณ„์‚ฐ : HxH
  • hidden layer์—์„œ output๊ฒฐ๊ณผ๋ฅผ ๋‚ด๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•ด ํ™•๋ฅ ๊ณ„์‚ฐ : HxV
  • ๋ณต์žก๋„ : O(HxH + HxV)

d. ๊ฐœ์„ 

  • V -> ln(V).
  • ๋ณต์žก๋„ : O(HxH)
  • H = 500 -> O(24๋งŒ)
  • NNLM์ด O(250๋งŒ)์ด์—ˆ๋˜ ๊ฒƒ์„ ์ƒ๊ฐํ•˜๋ฉด 1/N๋ฐฐ ์ค„์–ด๋“  ๊ฒƒ

2-2. word2vec๊ณผ์˜ ์„ฑ๋Šฅ ๋น„๊ต

1) Analogy Reasoning Task

image

  • ์„ฑ๋Šฅ ๋น„๊ต ์‹คํ—˜
  • pair / (Athens, Greece) -> Oslo -> ๊ฒฐ๊ณผ ๋ฅผ ์ฃผ๋Š” ๋ฐฉ์‹
  • semantic : ์ˆ˜๋„, ํ™”ํ, ์ฃผ, ๋„์‹œ์— ๊ด€๋ จ๋œ ๊ฒƒ
  • syntactic : ๋ฌธ๋ฒ•์— ๊ด€๋ จ๋œ ๊ฒƒ

์œ„์˜ ํ‘œ์—์„œ๋Š” v(Greece)-v(Athens)+v(Oslo) = v(Norway)

2) ๋ชจ๋ธ ๋น„๊ต

image

  • ๋‹จ์–ด ๋ฒกํ„ฐ ๊ธธ์ด : 640

  • RNNLM, NNLM์— ๋น„ํ•ด CBOW, Skip-gram์ด Semantic, Syntactic์— ๋Œ€ํ•ด ๋” ์ข‹์€ ๊ฒฐ๊ณผ

    • Skip-gram์ด Syntactic์— ๋Œ€ํ•œ ์ •ํ™•๋„๋Š” ๋‚ฎ๋‹ค
    • ํ•˜์ง€๋งŒ Semantic์— ๋Œ€ํ•ด์„œ๋Š” ํ›จ์”ฌ ๋†’์€ ๊ฒฐ๊ณผ

3) ๋ชจ๋ธ์„ 300dimension Skip-gram์œผ๋กœ ๊ณ ์ •ํ•œ ๊ฒฐ๊ณผ

image

  • NEG-k : k๊ฐœ ๋‹จ์–ด๋ฅผ ์„ ํƒํ•œ negative sampling
  • HS-Huffman : Huffman Tree๋ฅผ ์ด์šฉํ•œ Hierarchical Softmax
  • NCE-k : negative sampling๊ณผ ์œ ์‚ฌํ•œ ๋ฐฉ๋ฒ•. Noise Contrastive Estimation
    • ์ด ๋ฐฉ๋ฒ•์„ ๊ธฐ์ดˆ๋กœ ํ•˜์—ฌ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ์กฐ๊ธˆ ๋ฐ”๊ฟ” negative sampling ์ƒ์„ฑ

์ „์ฒด์ ์œผ๋กœ Hierarchical softmax๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒƒ๋ณด๋‹ค negative sampling์ด ๋” ์ข‹์€ ๊ฒฐ๊ณผ

์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋ฅผ subsamplingํ•˜๋ฉด ํ•™์Šต์‹œ๊ฐ„์ด ์ค„์–ด๋“ค๊ณ , ์„ฑ๋Šฅ๋„ ํ–ฅ์ƒ

ํ•˜์ง€๋งŒ, ์—ฌ๊ธฐ์„œ Hierarchical softmax๊ฐ€ ์•ˆ์ข‹๊ฒŒ ๋‚˜์™”๋‹ค๊ณ  ์‹ค์ œ ๋‚˜์œ ๋ฐฉ๋ฒ•์€ ์•„๋‹˜!

  • phrase์— ๋Œ€ํ•ด ์‹คํ—˜์„ ํ•˜๋ฉด Hierarchical softmax >> Negative sampling
  • ์–ด๋–ค ๋ฌธ์ œ์— ์ ์šฉํ•˜๋Š”์ง€์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์ง„๋‹ค.

  • NNLM, RNNLM ์€ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜๋Š” ๋ฐ ์ข‹์€ ๋ฐฉ๋ฒ•์ด์ง€๋งŒ, ํ•™์Šต์ด ๋„ˆ๋ฌด ๋Š๋ฆฌ๋‹ค.
  • ์ด๊ฑธ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด word2vec์ด ๋“ฑ์žฅํ•˜์˜€๋‹ค.

3. Word Embedding model

  • ์ œํ•œ๋œ ์ฐจ์›์œผ๋กœ ๋‹จ์–ด๋ฅผ vector๋กœ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐฉ๋ฒ•

    • ์žฅ์ 
      • ๋‹จ์–ด ๋ฒกํ„ฐ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. (one-hot encoding์˜ ๋‹จ์  ํ•ด๊ฒฐ)
      • ํ•„์š”ํ•œ ๋ฒกํ„ฐ ์ฐจ์› ์ˆ˜๊ฐ€ ์ ๋‹ค.
  • Word Embedding์˜ ์˜ˆ

    • ๊ทผ์ฒ˜์— ๋‚˜์˜จ ๋‹จ์–ด/๋น„์Šทํ•œ ๋‹จ์–ด์ผ์ˆ˜๋ก ๋ฒกํ„ฐ ๊ฐ„์˜ ์œ ์‚ฌ๋„๊ฐ€ ๋†’๋‹ค.
      ์˜ˆ๋ฅผ ๋“ค์–ด '์•„์ดํฐ ์„ฑ๋Šฅ'์ด๋ผ๋Š” ๋ฌธ์žฅ์ด ๋‚˜์˜ค๋Š” ๋นˆ๋„๊ฐ€ ๋†’๋‹ค๋ฉด '์•„์ดํฐ' ๋ฒกํ„ฐ์™€ '์„ฑ๋Šฅ' ๋ฒกํ„ฐ์˜ ์œ„์น˜๊ฐ€ ๋น„์Šทํ•  ๊ฒƒ์ด๋‹ค. (๋‘ ๋ฒกํ„ฐ ๊ฐ„์˜ ์œ ์‚ฌ๋„๊ฐ€ ๋†’๋‹ค.)
    • ๋‹จ์–ด ๋ฒกํ„ฐ ๊ฐ„์˜ ๋ง์…ˆ/๋บ„์…ˆ ํ›„์—๋„ ๊ทธ ๊ด€๊ณ„๊ฐ€ ์œ ์ง€. ๋ฒกํ„ฐ์˜ ์œ„์น˜์™€ ๊ฑฐ๋ฆฌ๊ฐ€ ์‹ค์ œ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ฐ˜์˜.
      ์˜ˆ๋ฅผ ๋“ค์–ด ์˜์‚ฌ-์—ฌ์ž+๋‚จ์ž์™€ ๊ฐ™์€ ์—ฐ์‚ฐ์„ ํ•˜๋ฉด '๋‚จ์˜์‚ฌ' ๋ฒกํ„ฐ์™€ ๊ฐ€๊นŒ์šด ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค.
  • ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๊ธฐ๋ฒ•์˜ ์ตœ๊ทผ ์—ฐ๊ตฌ ๋™ํ–ฅ: 2017๋…„ 8์›” ๊ธฐ์ค€.

  • Cosine Similarity - ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„

3-1. Word2Vec

๋‹จ์–ด๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์ž„๋ฒ ๋”ฉํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ž„์˜์˜ ์ฐจ์›(128์ฐจ์› ๋“ฑ)์„ ๊ฐ€์ง„ ๋ฒกํ„ฐ ๊ณต๊ฐ„์˜ ํ•œ ์ ์ด ํ•˜๋‚˜์˜ ๋‹จ์–ด๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ํ•œ ๋‹จ์–ด๊ฐ€ ์žˆ์„ ๋•Œ ์ฃผ๋ณ€์— ๋‚˜์˜จ ๋‹จ์–ด์™€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ขํžˆ๊ณ , ์ฃผ๋ณ€์— ์•ˆ ๋‚˜์˜จ ๋‹จ์–ด์™€์˜ ๊ฑฐ๋ฆฌ๋Š” ๋„“ํžˆ๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•œ๋‹ค.

1) ํ•œ๊ตญ์–ด Word2Vec

word2vector๋Š” bin ํ˜•์‹์ด๋‚˜ tsv ํ˜•์‹์œผ๋กœ ๋˜์–ด ์žˆ๋‹ค. gensim ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•ด์„œ ์‰ฝ๊ฒŒ ๋กœ๋”ฉํ•  ์ˆ˜ ์žˆ๋‹ค.

Pre-trained word vectors of 30+ languages๋Š” wikipedia database backup dump๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ word vector์œผ๋กœ, 30์—ฌ ๊ฐœ ์–ธ์–ด์˜ word vector๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ์‚ฌ์‹ค์ƒ ํ•œ๊ตญ์–ด๋กœ ๋œ ์œ ์ผํ•œ word vector์ด๋‹ค. ์—ฌ๊ธฐ์„œ ์ œ๊ณตํ•˜๋Š” code๋ฅผ ์ด์šฉํ•ด ์ƒˆ๋กญ๊ฒŒ word vector๋ฅผ ๋งŒ๋“ค ์ˆ˜๋„ ์žˆ๋‹ค. ๊ธฐ๋ณธ์œผ๋กœ ์ œ๊ณตํ•˜๋Š” model์€ ์ตœ๋Œ€ 2๋งŒ๊ฐœ ๋‹จ์–ด๋งŒ ์žˆ์œผ๋ฏ€๋กœ ์ด ๊ฐœ์ˆ˜๋ฅผ ๋Š˜๋ฆด ์ˆ˜ ์žˆ๋‹ค.

>>> import gensim
>>> model = gensim.models.Word2Vec.load('ko.bin') # word vector ๋กœ๋”ฉ

>>> model['์ธ๊ณต์ง€๋Šฅ'] # ์ธ๊ณต์ง€๋Šฅ word vector๋ฅผ numpy array ํ˜•ํƒœ๋กœ ์ถœ๋ ฅ
array([-0.06324194, -0.6008734 , -0.26231512,  1.3287116 , -0.32701576,
        0.5631857 , -0.7717267 ,  0.31624222, -0.02853541, -0.39608407,
(์ค‘๋žต)
       -0.2817213 ,  0.3327664 ,  0.15192133,  0.14019588, -0.8833335 ],
      dtype=float32)

>>> model.most_similar(['ํ•œ๊ตญ์–ด', '๋ฏธ๊ตญ'], ['ํ•œ๊ตญ']) 
# word vector๋ฅผ "ํ•œ๊ตญ์–ด + ๋ฏธ๊ตญ - ํ•œ๊ตญ"์œผ๋กœ ๊ณ„์‚ฐํ–ˆ์„ ๋•Œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒƒ. ๋ฏธ๊ตญ์˜ ํ•œ๊ตญ์–ด, ์ฆ‰ ๋ฏธ๊ตญ์˜ ์–ธ์–ด๋ฅผ ์˜๋„ํ•จ.
[('์˜์–ด', 0.6886879801750183), ('์ผ๋ณธ์–ด', 0.549891471862793), ('์˜๋ฌธ', 0.5408982038497925), 
('์–ธ์–ด', 0.5347272157669067), ('๋…์ผ์–ด', 0.5142326951026917), ('ํ”„๋ž‘์Šค์–ด', 0.5100058317184448), 
('๋ชจ๊ตญ์–ด', 0.47583508491516113), ('์ŠคํŽ˜์ธ์–ด', 0.46559274196624756), 
('์ค‘๊ตญ์–ด', 0.4549825191497803), ('์˜์–ด๊ถŒ', 0.4537474513053894)]

>>> model.n_similarity('ํ•œ๊ตญ', 'ํ•œ๊ตญ์–ด') 
# 'ํ•œ๊ตญ' word vector์™€ 'ํ•œ๊ตญ์–ด' word vector ๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„
0.76380414

>>> model.n_similarity('ํ•œ๊ตญ', '์˜์–ด') 
# ํ•œ๊ตญ-ํ•œ๊ตญ์–ด ๊ฐ„์˜ similarity๊ฐ€ ํ•œ๊ตญ-์˜์–ด ๊ฐ„์˜ similarity๋ณด๋‹ค ํฐ ๊ฒƒ์— ์ฃผ๋ชฉ.
0.08317596

# ๋ฏธ๊ตญ๊ณผ ์˜์–ด์™€์˜ similarity๋Š” ๋‚ฎ์€๋ฐ ์˜๊ตญ๊ณผ ์˜์–ด์˜ similarity๋Š” ๋†’์Œ. ๋ฏธ๊ตญ์˜ ๊ณต์šฉ์–ด๋„ ์˜์–ด์ธ๋ฐ similarity๊ฐ€ ๋‚ฎ๊ฒŒ ๋‚˜์™”์Œ. 
์ด์™€ ๊ฐ™์ด word2vec์ด ์ž˜ ํ•™์Šต๋˜์ง€ ์•Š์€ ์‚ฌ๋ก€๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ.
>>> model.n_similarity('๋ฏธ๊ตญ', '์˜์–ด')
0.11053801
>>> model.n_similarity('์˜๊ตญ', '์˜์–ด')
0.42289335

์ด ์™ธ ๋” ์ž์„ธํ•œ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•์€ gensim model example์„ ์ฐธ๊ณ ํ•˜๋ฉด ๋œ๋‹ค.

  • ํ•œ๊ตญ์–ด Word2Vec: ํ•œ๊ตญ์–ด word2vec ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•œ ๊ธ€.
  • ํ•œ๊ตญ์–ด Word2Vec ๋ฐ๋ชจ: ๋‹จ์–ด ๋ฒกํ„ฐ ๊ฐ„์˜ ์—ฐ์‚ฐ(๋ง์…ˆ, ๋บ„์…ˆ)์„ ํ•ด์„œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋‹จ์–ด๋ฅผ ์ฐพ์•„ ์ฃผ๋Š” ๋ฐ๋ชจ. ๊ฝค ์“ธ๋งŒํ•œ ํ’ˆ์งˆ์„ ๋ณด์—ฌ์ค€๋‹ค.

3-2. Doc2Vec

ํ•œ ๋ฌธ์žฅ์ด๋‚˜ ๋ฌธ๋‹จ, ๋ฌธ์„œ๋ฅผ ํ•˜๋‚˜์˜ Vector๋กœ ์ž„๋ฒ ๋”ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. Word2Vec์„ ํ™•์žฅํ•œ ๊ฐœ๋…์œผ๋กœ, document vector๊ฐ€ ๊ฐ ๋ฌธ์žฅ/๋ฌธ๋‹จ/๋ฌธ์„œ์— ๋‚˜์˜จ ๋‹จ์–ด vector๋“ค๊ณผ ๊ฐ€๊นŒ์›Œ์ง€๋„๋ก ํ•™์Šต์„ ์‹œํ‚จ๋‹ค. word2vec๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ doc2vec๋„ gensim์—์„œ ๊ตฌํ˜„์ด ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ๋ฐ”๋กœ ์“ธ ์ˆ˜ ์žˆ๋‹ค.

๋‘ ๋ฌธ์žฅ์ด ์žˆ์„ ๋•Œ ๊ทธ ๋ฌธ์žฅ์— ์žˆ๋Š” ๋‹จ์–ด๋“ค์˜ word vector๊ฐ€ ์œ ์‚ฌํ• ์ˆ˜๋ก document vector๋„ ์œ ์‚ฌํ•˜๊ฒŒ ๋‚˜์˜จ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "๊ณ ์–‘์ด๊ฐ€ ๊ธธ์„ ๊ฑด๋„Œ๋‹ค."์™€ "๊ฐ•์•„์ง€๊ฐ€ ๊ธธ์„ ๊ฑด๋„Œ๋‹ค."์—์„œ๋Š” ๊ณ ์–‘์ด์™€ ๊ฐ•์•„์ง€์˜ word vector๊ฐ€ ์œ ์‚ฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— document vector๋„ ์œ ์‚ฌํ•˜๊ฒŒ ๋‚˜์˜ฌ ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฐ๋ฐ "๊ฐ•์•„์ง€๊ฐ€ ๊ธธ์„ ๊ฑด๋„Œ๋‹ค"์™€ "๋กœ๋ด‡์ด ๊ธธ์„ ๊ฑด๋„Œ๋‹ค"์—์„œ ๊ฐ•์•„์ง€์™€ ๋กœ๋ด‡์˜ word vector๋Š” ๋œ ์œ ์‚ฌํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— document vector๋„ ์•ž์„  ์˜ˆ์‹œ๋ณด๋‹ค๋Š” ๋œ ์œ ์‚ฌํ•  ๊ฒƒ์ด๋‹ค.

3-3. Sent2Vec

  • ๋ฌธ์žฅ ํ•˜๋‚˜๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์ž„๋ฒ ๋”ฉํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ํ•™์Šต ๋ฐฉ๋ฒ•
    • CBOW๋ฅผ ๋ณ€ํ˜•ํ•˜์—ฌ Context๋ฅผ ํ•™์Šตํ•œ๋‹ค.
      • Window Size๋ฅผ ๋ฌธ์žฅ ์ „์ฒด๋กœ ๊ณ ์ •: ๋ฌธ์žฅ ์ „์ฒด์˜ ์˜๋ฏธ๋ฅผ ์‚ด๋ฆฌ๊ธฐ ์œ„ํ•ด ๋ฌธ์žฅ์˜ ๋ชจ๋“  n-gram์„ ์กฐํ•ฉํ•˜์—ฌ ํ•™์Šตํ•˜๋ฏ€๋กœ
      • Subsampling์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ: ์ค‘์š”ํ•œ n-gram ์Œ ์ƒ์„ฑ์„ ๋ฐฉํ•ดํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—.
    • ๋ณ€ํ˜•๋œ CBOW๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์ž n-gram ์Œ, ๋‹จ์–ด n-gram ์Œ์„ Context๋กœ ํ•˜์—ฌ ํ•™์Šตํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ n-gram์€ ์ผ๋ฐ˜์ ์ธ n-gram์ด ์•„๋‹ˆ๋ผ, bi-gram์˜ ์ตœ๋Œ€ ๊ฑฐ๋ฆฌ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ๋‹จ, ํ•œ๊ธ€์—์„  ์กฐํ•ฉ ๊ฐ€๋Šฅํ•œ ๋ฌธ์ž ์ˆ˜ ํŠน์„ฑ์ƒ ๋ฌธ์ž n-gram์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค.
    • Sent2Vec์„ ๊ตฌํ•  ๋• ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  Context ๋ฒกํ„ฐ์˜ ํ‰๊ท ์„ ๋‚ด์–ด ๊ตฌํ•œ๋‹ค.
  • ๋ฌธ์žฅ์„ Sent2Vec์œผ๋กœ ๋ฐ”๊พธ๊ณ , Sent2Vec ๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๋ฉด ๋ฌธ์žฅ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ด€๋ จ ๋…ผ๋ฌธ

โš ๏ธ **GitHub.com Fallback** โš ๏ธ