Text Classification - penny4860/study-note GitHub Wiki

1. ๋ฐ์ดํ„ฐ ๋ถ„์„

  • Histogram
    • x์ถ• : ๋ ˆ์ด๋ธ”
    • y์ถ• : ๋ฐ์ดํ„ฐ ์ˆซ์ž / ๋ฌธ์žฅ์˜ ๊ธธ์ด
  • Box Plot
    • ๋ ˆ์ด๋ธ” ๋ณ„ ๋ฐ์ดํ„ฐ ์ˆซ์ž
  • ์›Œ๋“œ ํด๋ผ์šฐ๋“œ
    • ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋œ ๋‹จ์–ด๋ฅผ ์‹œ๊ฐํ™”

2. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

  • ํ˜•๋ถ„๊ธฐ๋กœ tokenizing
    • ์–ด๊ฐ„ ์ถ”์ถœ ์˜ต์…˜ : stem=True
    • sentence -> word list
  • ๋ถˆ์šฉ์–ด (stopword) ์‚ญ์ œ
  • vectorize
    • index ๋ณ€ํ™˜ : tf.Tokenizer ๋ชจ๋“ˆ
      • word list๋ฅผ index list๋กœ ๋ณ€ํ™˜
    • ํŒจ๋”ฉ : tf.pad_sequence ๋ชจ๋“ˆ
      • ๋ชจ๋“  index list๋ฅผ fixed length๋กœ ๋งž์ถ˜๋‹ค.

3. RNN text ๋ถ„๋ฅ˜๊ธฐ

  • index list
  • Embedding Layer
  • LSTM -> LSTM
    • ์ด์ „ step์˜ ์ถœ๋ ฅ๊ณผ
    • ํ˜„์žฌ step์˜ Embedding Layer ์ถœ๋ ฅ์„ ์ž…๋ ฅ๋ฐ›๋Š”๋‹ค.
  • fc + softmax

4. CNN text ๋ถ„๋ฅ˜๊ธฐ

  • index list : [sequence,]
    • [30]
  • embed list : [sequence, embed-size, 1]
    • [30, 64, 1]
  • cnn :
    • input tensor : [sequence, embed-size, 1]
    • kernel :
      • [kh-size, kw-size, input-depth, output-depth]
      • [1/2/3/4, embed-size, 1, n-filters]
        • kh-size๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ๋กœ filtering ํ•˜๊ณ  concat
        • ๋„ค์ด๋ฒ„ ์นดํ…Œ๊ณ ๋ฆฌ ๋งค์นญ์˜ ๊ฒฝ์šฐ 1/2/3/4 4๊ฐœ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•จ.
    • output tensor : [sequence, 1, n-filters*4]
  • max pooling
    • [sequence, 1, n-filters*4] ==> [1, 1, n-filters*4]
  • flatten
    • [1, 1, n-filters*4] ==> [n-filters*4,]
  • fc + softmax
    • [n-filters*4,] ==> [n-categories,]

4.1. Text CNN ๊ตฌํ˜„ ๊ณผ์ •

1) ํ•™์Šต ๋ฐ์ดํ„ฐ ์ •๋ฆฌ

  • text / label ์„ 1๊ฐœ์˜ ํ…์ŠคํŠธํŒŒ์ผ๋กœ ์ €์žฅ
  • pandas๋กœ 1๋ฒˆ์— ์ฝ๊ฑฐ์˜ค๊ฑฐ๋‚˜ readlines() ๋กœ 1์ค„์”ฉ ์ฝ์–ด์˜ค์ž.

2) Text ์ •์ œ์ž‘์—…

  • ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ ํ˜•๋ถ„๊ธฐ๋ฅผ ์จ์„œ ์ •์ œ์ž‘์—…์„ ํ•ด์ค˜์•ผ ํ•จ.
  • ์กฐ์‚ฌ์ œ๊ฑฐ / ๋ถˆ์šฉ์–ด์ œ๊ฑฐ

3) word2vec ๋ชจ๋ธ ๋นŒ๋“œ

  • gensim ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉํ•ด์„œ ๋‹จ์–ด๋ฅผ vector๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” word2vec model์„ ๋นŒ๋“œํ•œ๋‹ค.
model = Word2Vec(cleaned_text_list, 64, window, min_count)
# model.mv.voca : ๋‹จ์–ด์‚ฌ์ „
# ๋‹จ์–ด์‚ฌ์ „์— ์žˆ๋Š” ๋‹จ์–ด๋Š” 64-d vector๋กœ ๋งŒ๋“ค์–ด์ค„์ˆ˜ ์žˆ๋‹ค.

4) keras Tokenizer ๋นŒ๋“œ

  • ๋‹จ์–ด 1๊ฐœ๋ฅผ ์ˆซ์ž 1๊ฐœ (index)๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” Tokenizer๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์ž‘์—…
# 0๋ฒˆ index๋Š” ๋‹จ์–ด ์‚ฌ์ „์— ์—†๋Š” ๋ชจ๋“  ๋‹จ์–ด(unknown)์˜ index
t = Tokenizer(voca_size + 1)
t.fit_on_text(...)
# json ํŒŒ์ผ๋กœ ์ €์žฅ

5) ์ž„๋ฒ ๋”ฉ ๋งคํŠธ๋ฆญ์Šค ๋นŒ๋“œ

  • 3)๊ณผ์ •์—์„œ ๋งŒ๋“ค์–ด๋†“์€ ์ž„๋ฒ ๋”ฉ ๋งคํŠธ๋ฆญ์Šค๋ฅผ keras์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ์˜ matrix๋กœ ๋ฐ”๊พธ์–ด ์ฃผ๋Š” ์ž‘์—…

  • ์ž… ์ถœ๋ ฅ ๊ด€๊ณ„

    1. ์ž…๋ ฅ
      • word2vec ๋ชจ๋ธ : 3)๊ณผ์ • ์—์„œ ๋นŒ๋“œ
      • keras Tokenizer : 4)๊ณผ์ •์—์„œ ๋นŒ๋“œ
    2. ์ถœ๋ ฅ
      • ์ž„๋ฒ ๋”ฉ matrix : [voca_size + 1, 64]
        • 0-th row vector : unknown ์— ๋Œ€ํ•œ embedding vector
        • 1-th row vector : keras Tokenizer์˜ 1๋ฒˆ index์˜ ๋‹จ์–ด์— ๋Œ€ํ•œ embedding vector
  • ๋‹จ์–ด ---> index ---> embedding vector

    1. ๋‹จ์–ด ---> index
      • Tokenizer
    2. index ---> embedding vector
      • word2vec

6) Keras Model ๊ตฌํ˜„

Embedding Layer -> cnn -> pooling -> fc + softmax

  • Embedding Layer
    1. ์ž…๋ ฅ
      • index list : [1, 10, 0, 0, 0]
    2. ์ถœ๋ ฅ
      • embeding vector : (5, 64)

7) Batch Generator ๊ตฌํ˜„

  • (index_list, ys) ๋ฅผ ํ•™์Šตํ•  ๋ชจ๋ธ์— ๋„˜๊ฒจ์•ผ ํ•จ.
  • ๊ฐ batch ๋ณ„๋กœ
    • xs : raw text ---> cleaned text ---> index
      • ์ •์ œํ•จ์ˆ˜ : 2)์—์„œ ํ˜•๋ถ„๊ธฐ ์“ธ๊ฒƒ
      • Tokenizer : 4)์—์„œ ๋นŒ๋“œํ•œ keras tokenizer๋ฅผ ์“ธ๊ฒƒ
    • ys : ์นดํ…Œ๊ณ ๋ฆฌ ํ…์ŠคํŠธ ---> ์นดํ…Œ๊ณ ๋ฆฌ ์ธ๋ฑ์Šค ---> 1-hot
      • category text : index ํ…Œ์ด๋ธ”์„ ๋งŒ๋“ค์–ด์•ผ ํ•จ.

8) ์˜ˆ์ธก๊ณผ์ •

  • ์ž… ์ถœ๋ ฅ
    • input : raw text
    • output : ์นดํ…Œ๊ณ ๋ฆฌ text
  • ์ค€๋น„๋ฌผ
    • ์ •์ œํ•จ์ˆ˜ : 2)์—์„œ ์‚ฌ์šฉํ•œ ํ˜•๋ถ„๊ธฐ
    • Keras Tokenizer : 4)์—์„œ ๋นŒ๋“œํ•œ keras tokenizer
    • ํ•™์Šตํ•œ ๋ชจ๋ธ
    • category text : index ํ…Œ์ด๋ธ”
  • ๊ณผ์ •
    1. raw text ---> cleaned text
      • ์ •์ œํ•จ์ˆ˜ ์‚ฌ์šฉ
    2. cleaned text ---> index
      • Keras Tokenizer ์‚ฌ์šฉ
    3. index ---> pred category index
      • ๋ชจ๋ธ
    4. category index ---> ์นดํ…Œ๊ณ ๋ฆฌ text
      • category text : index ํ…Œ์ด๋ธ”์„ ์—ญ์œผ๋กœ ์‚ฌ์šฉ