stopwords for konlpy - hexists/konlpy GitHub Wiki

stopwords for konlpy

stopwords PR์ด ์•„์ง merge ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
PR์ด merge ๋œ ๋’ค ์ด ๊ธ€์„ ๋ณด๋ฉด ๋” ์ข‹์„ ๊ฒƒ ๊ฐ™์•„ ์ ๊ทน์ ์ธ ๊ณต์œ ๋Š” ์ž ์‹œ ๋ฏธ๋ฃน๋‹ˆ๋‹ค.

stopwords

stopwords, ํ•œ๊ตญ์–ด๋กœ ๋ถˆ์šฉ์–ด๋Š” ํ…์ŠคํŠธ๋ฅผ ๋ถ„์„ํ•  ๋•Œ ์˜๋ฏธ๊ฐ€ ์—†์–ด ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
๋ณดํ†ต ๋ฌธ์„œ์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, ๊ด€์‚ฌ๋‚˜ ์ „์น˜์‚ฌ ํ˜น์€ ์กฐ์‚ฌ ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, ์˜์–ด์—์„œ 'a, the'์™€ ๊ฐ™์€ ๋‹จ์–ด๋‚˜ ํ•œ๊ตญ์–ด์—์„œ '์ด, ๊ฐ€'์™€ ๊ฐ™์€ ๋‹จ์–ด์ž…๋‹ˆ๋‹ค.

stopwords๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋‹ค๋ฃจ๋Š๋ƒ์— ๋”ฐ๋ผ ๋” ์ ์ ˆํ•œ ๋ถ„์„์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ์–ด์˜ ๋นˆ๋„๊ฐ€ ๋ถ„์„์‹œ ์ค‘์š”ํ•œ ์š”์†Œ๋ผ๊ณ  ํ•  ๋•Œ, ๊ณ ๋นˆ๋„์˜ stopwords๋ฅผ ์ œ์™ธํ•˜๋ฉด ๋” ์œ ์˜๋ฏธํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ณดํ†ต stopwords๋Š” ์ž…๋ ฅ ๋ฌธ์„œ์— ๋”ฐ๋ผ ์ •์˜ํ•ด์„œ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๋ฏธ๋ฆฌ ์ •์˜๋œ stopwords๊ฐ€ ์žˆ๋‹ค๋ฉด ๋ถ„์„ ์ž‘์—…์„ ์ข€ ๋” ์‰ฝ๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋Œ€ํ‘œ์ ์ธ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋„๊ตฌ์ธ NLTK์—์„œ๋Š” 11๊ฐœ ์–ธ์–ด๋ฅผ ๋Œ€์ƒ์œผ๋กœ 2,400๊ฐœ์˜ stopwords๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋ฏธ๋ฆฌ ์ œ๊ณต๋œ stopwords๋ฅผ ํ†ตํ•ด ๋ถ„์„์‹œ ํฐ ์˜๋ฏธ๊ฐ€ ์—†๋Š” ๋‹จ์–ด๋“ค์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์•„์‰ฝ๊ฒŒ๋„ NLTK๋Š” ํ•œ๊ตญ์–ด stopwords๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ, ํ•œ๊ตญ์–ด๋ฅผ ์œ„ํ•œ stopwords๋ฅผ ์ œ๊ณตํ•˜๋ฉด ์ข‹๊ฒ ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•œ๊ตญ์–ด stopwords ๊ฐœ๋ฐœ

ํ•œ๊ตญ์–ด stopwords๋Š” ํฌ๊ฒŒ 3๋‹จ๊ณ„ ๊ณผ์ •์œผ๋กœ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

1) nltk stopwords ์ œ๊ณต ๋ฐฉ๋ฒ• ์กฐ์‚ฌ
2) ํ•œ๊ตญ์–ด stopwords ์ž๋ฃŒ ์กฐ์‚ฌ ๋ฐ ์ˆ˜์ง‘
3) stopwords ๊ธฐ๋Šฅ ๊ฐœ๋ฐœ

๋จผ์ €, NLTK์—์„œ stopwords๋ฅผ ์–ด๋–ป๊ฒŒ ์ œ๊ณตํ•˜๋Š”์ง€๋ฅผ ํ™•์ธํ•ด๋ดค์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ , ํ•œ๊ตญ์–ด stopwords ๊ฐœ๋ฐœ์„ ์œ„ํ•ด ์ž๋ฃŒ๋“ค์„ ์กฐ์‚ฌํ•˜๊ณ  ์ˆ˜์ง‘ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ konlpy์—์„œ stopwords๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด ์ฝ”๋“œ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

nltk stopwords ์ œ๊ณต ๋ฐฉ๋ฒ• ์กฐ์‚ฌ

NLTK๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ stopwords๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

>>> from nltk.corpus import stopwords  
>>> stopwords.words('english')

์ด๋ ‡๊ฒŒ ํ˜ธ์ถœ์„ ํ•˜๋ฉด, list ํ˜•ํƒœ๋กœ stopwords๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค. stopwords๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ์•ฝ๊ฐ„์˜ ์ฝ”๋“œ๋ฅผ ๋”ํ•ด์„œ ํ•„์š”ํ•œ ํ˜•ํƒœ๋กœ ์‚ฌ์šฉํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ์„ tokenize ๋‹ค์Œ stopwords์— ํฌํ•จ๋œ ๊ฒฝ์šฐ๋ฅผ ์ œ์™ธํ•˜๊ฑฐ๋‚˜, stopwords๊ฐ€ ํฌํ•จ๋œ token์— penalty๋ฅผ ์ฃผ๋Š” ๋“ฑ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํ˜•ํƒœ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

stopwords๋Š” nltk์˜ corpus class์— ์œ„์น˜ํ•ฉ๋‹ˆ๋‹ค. corpus class์—์„œ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ corpus๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ๊ทธ ์ค‘ ํ•˜๋‚˜๋กœ stopwords corpus๊ฐ€ ์ œ๊ณต๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. stopwords์˜ ์ถœ์ฒ˜๋Š” Porter el al๋กœ ๋‚˜์˜ค๋Š”๋ฐ, Porter Stemmer์—์„œ ๊ฐ€์ ธ์˜จ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. (์ด ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ๋Š” ์ข€ ๋” ํ™•์ธ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.)

Corpus	Compiler	Contents
Stopwords Corpus	Porter et al	2,400 stopwords for 11 languages

ํ•œ๊ตญ์–ด stopwords ์ž๋ฃŒ ์กฐ์‚ฌ ๋ฐ ์ˆ˜์ง‘

ํ•œ๊ตญ์–ด stopwords๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด ๊ฒ€์ƒ‰์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ์ž๋ฃŒ๋“ค์„ ์ˆ˜์ง‘ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ์ž๋ฃŒ๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ด์™ธ์—๋„ ์—ฌ๋Ÿฌ ์ž๋ฃŒ๋“ค์ด ์žˆ์—ˆ์ง€๋งŒ ์ค‘๋ณต๋˜๋Š” ๊ฒƒ๋“ค์ž…๋‹ˆ๋‹ค.

source unit license
bab2min morph ์ €์ž์—๊ฒŒ ์‚ฌ์šฉ ํ—ˆ๋ฝ ๋ฐ›์Œ
ranks.nl word MIT๋กœ ์ถ”์ •
spikeekips gist word ์ €์ž์—๊ฒŒ ์‚ฌ์šฉ ํ—ˆ๋ฝ ๋ฐ›์Œ
6 github word apache 2.0
stopwords-iso github word MIT
many-stop-words word MIT

ํ˜•ํƒœ์†Œ ๋‹จ์œ„๋กœ ์ œ๊ณต๋˜๋Š” bab2min์˜ stopwords์™€ ๋‹จ์–ด ๋‹จ์œ„๋กœ ์ˆ˜์ง‘๋œ ๋‹ค๋ฅธ stopwords๋“ค์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ๊ธฐ ๋‹ค๋ฅธ format์œผ๋กœ ์ €์žฅ๋˜์–ด ์žˆ๊ณ , ์ž๋ฃŒ๋“ค๊ฐ„์˜ ์ค‘๋ณต๋„ ๋งŽ์•„ ํ•˜๋‚˜๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  • bab2min stopwords

    ์ด	VCP	0.018279601
    ์žˆ	VA	0.011699048
    ํ•˜	VV	0.009773658
    ...
    
  • ranks.nl stopwords

    ์•„
    ํœด
    ์•„์ด๊ตฌ
    ...
    

์œ„ ์ž๋ฃŒ๋“ค์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋ผ์ด์„ผ์Šค๋ฅผ ํ™•์ธ ๋˜๋Š” ์ €์ž์˜ ํ—ˆ๋ฝ์ด ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค. konlpy์—์„œ ์‚ฌ์šฉํ•  ๋•Œ ๋ฌธ์ œ ์—†์Œ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ธ€์„ ๋นŒ์–ด ํ”์พŒํžˆ stopwords ์‚ฌ์šฉ์„ ํ—ˆ๋ฝํ•ด์ฃผ์‹  bab2min, spikeekips๊ป˜ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

stopwords ๊ธฐ๋Šฅ ๊ฐœ๋ฐœ

์‚ฌ์šฉ์ž๊ฐ€ ์ต์ˆ™ํ•˜๊ฒŒ ์‚ฌ์šฉํ–ˆ๋˜ ํ˜•ํƒœ๋กœ konlpy์—์„œ๋„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฒƒ ๊ฐ™์•„ nltk์—์„œ ์ œ๊ณตํ•˜๋Š” ํ˜•ํƒœ๋กœ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•จ์ˆ˜์˜ ํ”„๋กœํ†  ํƒ€์ž…์„ ์ •์˜ํ•˜๊ณ  ๋ช‡๋ช‡ ํ•„์š”ํ•œ ๊ธฐ๋Šฅ์„ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์šฉ์„ฑ ์ธก๋ฉด์—์„œ ๋‹จ์–ด ๋‹จ์œ„์˜ stopwords์™€ ํ˜•ํƒœ์†Œ ๋‹จ์œ„์˜ stopwords๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฒƒ ๊ฐ™์•„, 2๊ฐ€์ง€ ํ˜•ํƒœ์˜ stopwords๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ˜•ํƒœ์†Œ ๋‹จ์œ„์˜ ๊ฒฝ์šฐ ๋ถ„์„๊ธฐ์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ํ’ˆ์‚ฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ ๊ฐ๊ฐ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋„๋ก ๋ถ„์„๊ธฐ์— ๋งž๋Š” stopwords๋ฅผ ์ค€๋น„ํ•˜๊ณ , analyzer๋ฅผ ์ง€์ •ํ•ด์„œ ์‚ฌ์šฉํ•˜๋„๋ก ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

>>> from konlpy.corpus import stopwords
>>> stopwords.words()
>>> stopwords.morph(analyzer='kkma')

๋˜ํ•œ, ์‚ฌ์šฉ์ž๊ฐ€ ํ•„์š”์— ๋”ฐ๋ผ ๋”ํ•˜๊ณ  ๋นผ์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ๋„ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

>>> stopwords.include('word', ['ํ—', '๋„ค'])
>>> stopwords.exclude('word', ['์ง„์งœ'])

์œ„ ๊ธฐ๋Šฅ๋“ค์„ konlpy.corpus์— stopwords๋ผ๋Š” class๋ฅผ ํ†ตํ•ด ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ณ„๋„๋กœ ์ˆ˜์ง‘๋œ stopwords๋ฅผ konlpy์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋‚˜์˜ ํŒŒ์ผ๋กœ ํ•ฉ์น˜๋Š” ์ „์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ์„ ๊ฐœ๋ฐœํ•˜๊ณ , ํ˜•ํƒœ์†Œ ๋‹จ์œ„์˜ stopwords ์ œ๊ณต์„ ์œ„ํ•ด analyzer์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํ’ˆ์‚ฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ์„ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. (์ด ํ”„๋กœ๊ทธ๋žจ๋“ค์€ konlpy์— ํฌํ•จํ•˜๋Š” ๊ฒƒ์ด ์ ์ ˆํ•œ์ง€ ํŒ๋‹จํ•  ์ˆ˜ ์—†์–ด, ๋‹ค๋ฅธ repo์— ์ €์žฅํ•œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค. maintainer์—๊ฒŒ ํ•ด๋‹น ๋‚ด์šฉ์„ ๋ฌธ์˜ํ•œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.)

Usage stopwords

konlpy์—์„œ stopwords๋ฅผ ์•„์ฃผ ๊ฐ„๋‹จํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • stopwords ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

    >>> from konlpy.corpus import stopwords
    
  • stopwords ์‚ฌ์šฉํ•˜๊ธฐ(๋‹จ์–ด ๋‹จ์œ„)

    >>> stopwords.words()
    ['!', '"', '$', ... ]
    
  • stopwords ์‚ฌ์šฉํ•˜๊ธฐ(ํ˜•ํƒœ์†Œ ๋‹จ์œ„)

    >>> stopwords.morphs(analyzer='kkma')
    ['๊ฐ€/VV', '๊ฐ€์ง€/VV', '๊ฐ™/VA', ... ]
    
  • stopwords ์ถ”๊ฐ€ํ•˜๊ธฐ

    >>> stopwords.include('word', ['ํ—', '๋„ค'])
    
  • stopwords ์ œ์™ธํ•˜๊ธฐ

    >>> stopwords.exclude('word', ['์ง„์งœ'])
    

Futher Works

  • stopwords๋ฅผ ํ™œ์šฉํ•˜๋Š” example ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค. ์–ด๋–ป๊ฒŒ ์“ฐ๋Š”์ง€ ์–ด๋–ค ์ ์ด ์ข‹์€์ง€ ์„ค๋ช…ํ•˜๋Š” ๊ธ€์„ ํ†ตํ•ด ํ™œ์šฉ๋„๋ฅผ ๋†’์ด๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.
  • ๋‹ค์–‘ํ•œ ๋ฌธ์„œ์—์„œ stopwords ์ˆ˜์ง‘ํ•˜์—ฌ, stopwords๋ฅผ ๋ณด๊ฐ•ํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

References