TfidfVectorizer - smart1004/doc GitHub Wiki

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline

tfidf = TfidfVectorizer(tokenizer=tokenizer, ngram_range=(2,2), min_df=2, stop_words = stops, max_features = 20000, lowercase = True, sublinear_tf=True, use_idf=True)

tfidf_matrix ์–˜๋„ค๋“ค์€ ์ €์žฅํ•ด๋‘๊ณ  ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ• ์ˆ˜ ์—†๋Š”๊ฐ€?

๋ฏธ๋ฆฌ ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•ด๋‘์ง€ ์•Š์œผ๋ฉด ๋งŒ๋“œ๋Š”๋ฐ ์‹œ๊ฐ„์ด ๋งŽ์ด ๊ฑธ๋ฆด ๊ฒƒ ๊ฐ™๋‹ค

https://datascienceschool.net/view-notebook/3e7aadbf88ed4f0d87a76f9ddc925d69/

Scikit-Learn์˜ ๋ฌธ์„œ ์ „์ฒ˜๋ฆฌ ๊ธฐ๋Šฅยถ BOW (Bag of Words)ยถ ๋ฌธ์„œ๋ฅผ ์ˆซ์ž ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•์€ BOW (Bag of Words) ์ด๋‹ค. BOW ๋ฐฉ๋ฒ•์—์„œ๋Š” ์ „์ฒด ๋ฌธ์„œ {d1,d2,โ€ฆ,dn} ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๊ณ ์ •๋œ ๋‹จ์–ด์žฅ(vocabulary) {t1,t2,โ€ฆ,tm} ๋ฅผ ๋งŒ๋“ค๊ณ  di ๋ผ๋Š” ๊ฐœ๋ณ„ ๋ฌธ์„œ์— ๋‹จ์–ด์žฅ์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ๋Š”์ง€๋ฅผ ํ‘œ์‹œํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

xi,j=๋ฌธ์„œ di๋‚ด์˜ ๋‹จ์–ด tj์˜ ์ถœํ˜„ ๋นˆ๋„

๋˜๋Š”

xi,j={0,1.๋งŒ์•ฝ ๋‹จ์–ด tj๊ฐ€ ๋ฌธ์„œ di ์•ˆ์— ์—†์œผ๋ฉด๋งŒ์•ฝ ๋‹จ์–ด tj๊ฐ€ ๋ฌธ์„œ di ์•ˆ์— ์žˆ์œผ๋ฉด

Scikit-Learn ์˜ ๋ฌธ์„œ ์ „์ฒ˜๋ฆฌ ๊ธฐ๋Šฅยถ Scikit-Learn ์˜ feature_extraction ์„œ๋ธŒํŒจํ‚ค์ง€์™€ feature_extraction.text ์„œ๋ธŒ ํŒจํ‚ค์ง€๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์„œ ์ „์ฒ˜๋ฆฌ์šฉ ํด๋ž˜์Šค๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

DictVectorizer:

๊ฐ ๋‹จ์–ด์˜ ์ˆ˜๋ฅผ ์„ธ์–ด๋†“์€ ์‚ฌ์ „์—์„œ BOW ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ ๋‹ค. CountVectorizer:

๋ฌธ์„œ ์ง‘ํ•ฉ์—์„œ ๋‹จ์–ด ํ† ํฐ์„ ์ƒ์„ฑํ•˜๊ณ  ๊ฐ ๋‹จ์–ด์˜ ์ˆ˜๋ฅผ ์„ธ์–ด BOW ์ธ์ฝ”๋”ฉํ•œ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ ๋‹ค. TfidfVectorizer:

CountVectorizer์™€ ๋น„์Šทํ•˜์ง€๋งŒ TF-IDF ๋ฐฉ์‹์œผ๋กœ ๋‹จ์–ด์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•œ BOW ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ ๋‹ค. HashingVectorizer:

ํ•ด์‹œ ํ•จ์ˆ˜(hash function)์„ ์‚ฌ์šฉํ•˜์—ฌ ์ ์€ ๋ฉ”๋ชจ๋ฆฌ์™€ ๋น ๋ฅธ ์†๋„๋กœ BOW ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ ๋‹ค. DictVectorizerยถ DictVectorizer๋Š” feature_extraction ์„œ๋ธŒ ํŒจํ‚ค์ง€์—์„œ ์ œ๊ณตํ•œ๋‹ค. ๋ฌธ์„œ์—์„œ ๋‹จ์–ด์˜ ์‚ฌ์šฉ ๋นˆ๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋”•์…”๋„ˆ๋ฆฌ ์ •๋ณด๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ BOW ์ธ์ฝ”๋”ฉํ•œ ์ˆ˜์น˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

In [1]: from sklearn.feature_extraction import DictVectorizer v = DictVectorizer(sparse=False) D = [{'A': 1, 'B': 2}, {'B': 3, 'C': 1}] X = v.fit_transform(D) X array([[1., 2., 0.], [0., 3., 1.]])

In [2]: v.feature_names_ ['A', 'B', 'C'] In [3]: v.transform({'C': 4, 'D': 3}) array(0., 0., 4.)

@@ CountVectorizerยถ CountVectorizer๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ธ๊ฐ€์ง€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

๋ฌธ์„œ๋ฅผ ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ๊ฐ ๋ฌธ์„œ์—์„œ ํ† ํฐ์˜ ์ถœํ˜„ ๋นˆ๋„๋ฅผ ์„ผ๋‹ค. ๊ฐ ๋ฌธ์„œ๋ฅผ BOW ์ธ์ฝ”๋”ฉ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. In [4]: from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?', 'The last document?',
] vect = CountVectorizer() vect.fit(corpus) vect.vocabulary_ {'this': 9, 'is': 3, 'the': 7, 'first': 2, 'document': 1, 'second': 6, 'and': 0, 'third': 8, 'one': 5, 'last': 4} In [5]: vect.transform(['This is the second document.']).toarray() array(0, 1, 0, 1, 0, 0, 1, 1, 0, 1) In [6]: vect.transform(['Something completely new.']).toarray() array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0) In [7]: vect.transform(corpus).toarray() array([[0, 1, 1, 1, 0, 0, 0, 1, 0, 1], [0, 1, 0, 1, 0, 0, 2, 1, 0, 1], [1, 0, 0, 0, 0, 1, 0, 1, 1, 0], [0, 1, 1, 1, 0, 0, 0, 1, 0, 1], [0, 1, 0, 0, 1, 0, 0, 1, 0, 0]])

CountVectorizer๋Š” ์ด๋Ÿฌํ•œ ์ž‘์—…์„ ํ•˜๊ธฐ ์œ„ํ•œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ธ์ˆ˜๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค.

stop_words : ๋ฌธ์ž์—ด {โ€˜englishโ€™}, ๋ฆฌ์ŠคํŠธ ๋˜๋Š” None (๋””ํดํŠธ) stop words ๋ชฉ๋ก.โ€˜englishโ€™์ด๋ฉด ์˜์–ด์šฉ ์Šคํƒ‘ ์›Œ๋“œ ์‚ฌ์šฉ. analyzer : ๋ฌธ์ž์—ด {โ€˜wordโ€™, โ€˜charโ€™, โ€˜char_wbโ€™} ๋˜๋Š” ํ•จ์ˆ˜ ๋‹จ์–ด n-๊ทธ๋žจ, ๋ฌธ์ž n-๊ทธ๋žจ, ๋‹จ์–ด ๋‚ด์˜ ๋ฌธ์ž n-๊ทธ๋žจ token_pattern : string ํ† ํฐ ์ •์˜์šฉ ์ •๊ทœ ํ‘œํ˜„์‹ tokenizer : ํ•จ์ˆ˜ ๋˜๋Š” None (๋””ํดํŠธ) ํ† ํฐ ์ƒ์„ฑ ํ•จ์ˆ˜ . ngram_range : (min_n, max_n) ํŠœํ”Œ n-๊ทธ๋žจ ๋ฒ”์œ„ max_df : ์ •์ˆ˜ ๋˜๋Š” [0.0, 1.0] ์‚ฌ์ด์˜ ์‹ค์ˆ˜. ๋””ํดํŠธ 1 ๋‹จ์–ด์žฅ์— ํฌํ•จ๋˜๊ธฐ ์œ„ํ•œ ์ตœ๋Œ€ ๋นˆ๋„ min_df : ์ •์ˆ˜ ๋˜๋Š” [0.0, 1.0] ์‚ฌ์ด์˜ ์‹ค์ˆ˜. ๋””ํดํŠธ 1 ๋‹จ์–ด์žฅ์— ํฌํ•จ๋˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œ ๋นˆ๋„ Stop Wordsยถ Stop Words ๋Š” ๋ฌธ์„œ์—์„œ ๋‹จ์–ด์žฅ์„ ์ƒ์„ฑํ•  ๋•Œ ๋ฌด์‹œํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์–ด๋ฅผ ๋งํ•œ๋‹ค. ๋ณดํ†ต ์˜์–ด์˜ ๊ด€์‚ฌ๋‚˜ ์ ‘์†์‚ฌ, ํ•œ๊ตญ์–ด์˜ ์กฐ์‚ฌ ๋“ฑ์ด ์—ฌ๊ธฐ์— ํ•ด๋‹นํ•œ๋‹ค. stop_words ์ธ์ˆ˜๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค.

In [8]: vect = CountVectorizer(stop_words=["and", "is", "the", "this"]).fit(corpus) vect.vocabulary_ {'first': 1, 'document': 0, 'second': 4, 'third': 5, 'one': 3, 'last': 2} In [9]: vect = CountVectorizer(stop_words="english").fit(corpus) vect.vocabulary_ {'document': 0, 'second': 1} ํ† ํฐยถ analyzer, tokenizer, token_pattern ๋“ฑ์˜ ์ธ์ˆ˜๋กœ ์‚ฌ์šฉํ•  ํ† ํฐ ์ƒ์„ฑ๊ธฐ๋ฅผ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋‹ค.

In [10]: vect = CountVectorizer(analyzer="char").fit(corpus) vect.vocabulary_ {'t': 16, 'h': 8, 'i': 9, 's': 15, ' ': 0, 'e': 6, 'f': 7, 'r': 14, 'd': 5, 'o': 13, 'c': 4, 'u': 17, 'm': 11, 'n': 12, '.': 1, 'a': 3, '?': 2, 'l': 10} In [11]: vect = CountVectorizer(token_pattern="t\w+").fit(corpus) vect.vocabulary_ {'this': 2, 'the': 0, 'third': 1} In [12]: import nltk

vect = CountVectorizer(tokenizer=nltk.word_tokenize).fit(corpus) vect.vocabulary_ {'this': 11, 'is': 5, 'the': 9, 'first': 4, 'document': 3, '.': 0, 'second': 8, 'and': 2, 'third': 10, 'one': 7, '?': 1, 'last': 6} n-๊ทธ๋žจยถ n-๊ทธ๋žจ์€ ๋‹จ์–ด์žฅ ์ƒ์„ฑ์— ์‚ฌ์šฉํ•  ํ† ํฐ์˜ ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค. ๋ชจ๋…ธ๊ทธ๋žจ(1-๊ทธ๋žจ)์€ ํ† ํฐ ํ•˜๋‚˜๋งŒ ๋‹จ์–ด๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ ๋ฐ”์ด๊ทธ๋žจ(2-๊ทธ๋žจ)์€ ๋‘ ๊ฐœ์˜ ์—ฐ๊ฒฐ๋œ ํ† ํฐ์„ ํ•˜๋‚˜์˜ ๋‹จ์–ด๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

In [13]: vect = CountVectorizer(ngram_range=(2, 2)).fit(corpus) vect.vocabulary_ {'this is': 12, 'is the': 2, 'the first': 7, 'first document': 1, 'the second': 9, 'second second': 6, 'second document': 5, 'and the': 0, 'the third': 10, 'third one': 11, 'is this': 3, 'this the': 13, 'the last': 8, 'last document': 4} In [14]: vect = CountVectorizer(ngram_range=(1, 2), token_pattern="t\w+").fit(corpus) vect.vocabulary_ {'this': 3, 'the': 0, 'this the': 4, 'third': 2, 'the third': 1} ๋นˆ๋„์ˆ˜ยถ max_df, min_df ์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์„œ์—์„œ ํ† ํฐ์ด ๋‚˜ํƒ€๋‚œ ํšŸ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์–ด์žฅ์„ ๊ตฌ์„ฑํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ํ† ํฐ์˜ ๋นˆ๋„๊ฐ€ max_df๋กœ ์ง€์ •ํ•œ ๊ฐ’์„ ์ดˆ๊ณผ ํ•˜๊ฑฐ๋‚˜ min_df๋กœ ์ง€์ •ํ•œ ๊ฐ’๋ณด๋‹ค ์ž‘์€ ๊ฒฝ์šฐ์—๋Š” ๋ฌด์‹œํ•œ๋‹ค. ์ธ์ˆ˜ ๊ฐ’์€ ์ •์ˆ˜์ธ ๊ฒฝ์šฐ ํšŸ์ˆ˜, ๋ถ€๋™์†Œ์ˆ˜์ ์ธ ๊ฒฝ์šฐ ๋น„์ค‘์„ ๋œปํ•œ๋‹ค.

In [15]: vect = CountVectorizer(max_df=4, min_df=2).fit(corpus) vect.vocabulary_, vect.stop_words_ ({'this': 3, 'is': 2, 'first': 1, 'document': 0}, {'and', 'last', 'one', 'second', 'the', 'third'}) In [16]: vect.transform(corpus).toarray().sum(axis=0) array([4, 2, 3, 3]) TF-IDFยถ TF-IDF(Term Frequency โ€“ Inverse Document Frequency) ์ธ์ฝ”๋”ฉ์€ ๋‹จ์–ด๋ฅผ ๊ฐฏ์ˆ˜ ๊ทธ๋Œ€๋กœ ์นด์šดํŠธํ•˜์ง€ ์•Š๊ณ  ๋ชจ๋“  ๋ฌธ์„œ์— ๊ณตํ†ต์ ์œผ๋กœ ๋“ค์–ด์žˆ๋Š” ๋‹จ์–ด์˜ ๊ฒฝ์šฐ ๋ฌธ์„œ ๊ตฌ๋ณ„ ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง„๋‹ค๊ณ  ๋ณด์•„ ๊ฐ€์ค‘์น˜๋ฅผ ์ถ•์†Œํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

๊ตฌ์ œ์ ์œผ๋กœ๋Š” ๋ฌธ์„œ d (document)์™€ ๋‹จ์–ด t ์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐํ•œ๋‹ค.

tf-idf(d,t)=tf(d,t)โ‹…idf(t)

์—ฌ๊ธฐ์—์„œ

tf(d,t) : term frequency. ํŠน์ •ํ•œ ๋‹จ์–ด์˜ ๋นˆ๋„์ˆ˜ idf(t) : inverse document frequency. ํŠน์ •ํ•œ ๋‹จ์–ด๊ฐ€ ๋“ค์–ด ์žˆ๋Š” ๋ฌธ์„œ์˜ ์ˆ˜์— ๋ฐ˜๋น„๋ก€ํ•˜๋Š” ์ˆ˜

idf(d,t)=logn1+df(t)

n : ์ „์ฒด ๋ฌธ์„œ์˜ ์ˆ˜

df(t) : ๋‹จ์–ด t ๋ฅผ ๊ฐ€์ง„ ๋ฌธ์„œ์˜ ์ˆ˜ In [17]: from sklearn.feature_extraction.text import TfidfVectorizer In [18]: tfidv = TfidfVectorizer().fit(corpus) tfidv.transform(corpus).toarray() array([[0. , 0.38947624, 0.55775063, 0.4629834 , 0. , 0. , 0. , 0.32941651, 0. , 0.4629834 ], [0. , 0.24151532, 0. , 0.28709733, 0. , 0. , 0.85737594, 0.20427211, 0. , 0.28709733], [0.55666851, 0. , 0. , 0. , 0. , 0.55666851, 0. , 0.26525553, 0.55666851, 0. ], [0. , 0.38947624, 0.55775063, 0.4629834 , 0. , 0. , 0. , 0.32941651, 0. , 0.4629834 ], [0. , 0.45333103, 0. , 0. , 0.80465933, 0. , 0. , 0.38342448, 0. , 0. ]]) Hashing Trickยถ CountVectorizer๋Š” ๋ชจ๋“  ์ž‘์—…์„ ๋ฉ”๋ชจ๋ฆฌ ์ƒ์—์„œ ์ˆ˜ํ–‰ํ•˜๋ฏ€๋กœ ์ฒ˜๋ฆฌํ•  ๋ฌธ์„œ์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์ง€๋ฉด ์†๋„๊ฐ€ ๋А๋ ค์ง€๊ฑฐ๋‚˜ ์‹คํ–‰์ด ๋ถˆ๊ฐ€๋Šฅํ•ด์ง„๋‹ค. ์ด ๋•Œ HashingVectorizer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ•ด์‹œ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์–ด์— ๋Œ€ํ•œ ์ธ๋ฑ์Šค ๋ฒˆํ˜ธ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ ๋ฐ ์‹คํ–‰ ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

In [19]: from sklearn.datasets import fetch_20newsgroups twenty = fetch_20newsgroups() len(twenty.data) 11314 In [20]: %time CountVectorizer().fit(twenty.data).transform(twenty.data); CPU times: user 8.33 s, sys: 120 ms, total: 8.45 s Wall time: 8.46 s <11314x130107 sparse matrix of type '' with 1787565 stored elements in Compressed Sparse Row format> In [21]: from sklearn.feature_extraction.text import HashingVectorizer hv = HashingVectorizer(n_features=300000) In [22]: %time hv.transform(twenty.data); CPU times: user 3.97 s, sys: 80 ms, total: 4.05 s Wall time: 2.49 s <11314x300000 sparse matrix of type '' with 1786336 stored elements in Compressed Sparse Row format> ์˜ˆยถ ๋‹ค์Œ์€ Scikit-Learn์˜ ๋ฌธ์ž์—ด ๋ถ„์„๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›น์‚ฌ์ดํŠธ์— ํŠน์ •ํ•œ ๋‹จ์–ด๊ฐ€ ์–ด๋А ์ •๋„ ์‚ฌ์šฉ๋˜์—ˆ๋Š”์ง€ ๋นˆ๋„์ˆ˜๋ฅผ ์•Œ์•„๋ณด๋Š” ์ฝ”๋“œ์ด๋‹ค.

In [23]: from urllib.request import urlopen import json import string from konlpy.utils import pprint from konlpy.tag import Hannanum hannanum = Hannanum()

f = urlopen("https://www.datascienceschool.net/download-notebook/708e711429a646818b9dcbb581e0c10a/") json = json.loads(f.read()) cell = ["\n".join(c["source"]) for c in json["cells"] if c["cell_type"] == "markdown"] docs = [w for w in hannanum.nouns(" ".join(cell)) if ((not w[0].isnumeric()) and (w[0] not in string.punctuation))] ์—ฌ๊ธฐ์—์„œ๋Š” ํ•˜๋‚˜์˜ ๋ฌธ์„œ๊ฐ€ ํ•˜๋‚˜์˜ ๋‹จ์–ด๋กœ๋งŒ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ CountVectorizer๋กœ ์ด ๋ฌธ์„œ ์ง‘ํ•ฉ์„ ์ฒ˜๋ฆฌํ•˜๋ฉด ๊ฐ ๋ฌธ์„œ๋Š” ํ•˜๋‚˜์˜ ์›์†Œ๋งŒ 1์ด๊ณ  ๋‚˜๋จธ์ง€ ์›์†Œ๋Š” 0์ธ ๋ฒกํ„ฐ๊ฐ€ ๋œ๋‹ค. ์ด ๋ฒกํ„ฐ์˜ ํ•ฉ์œผ๋กœ ๋นˆ๋„๋ฅผ ์•Œ์•„๋ณด์•˜๋‹ค.

In [24]: vect = CountVectorizer().fit(docs) count = vect.transform(docs).toarray().sum(axis=0) idx = np.argsort(-count) count = count[idx] feature_name = np.array(vect.get_feature_names())[idx] plt.bar(range(len(count)), count) plt.show()

In [25]: pprint(list(zip(feature_name, count))[:10]) [('์ปจํ…Œ์ด๋„ˆ', 81), ('๋„์ปค', 41), ('๋ช…๋ น', 34), ('์ด๋ฏธ์ง€', 33), ('์‚ฌ์šฉ', 26), ('๊ฐ€๋™', 14), ('์ค‘์ง€', 13), ('mingw64', 13), ('์‚ญ์ œ', 12), ('์ด๋ฆ„', 11)]