토큰화 : tokenize - BD-SEARCH/MLtutorial GitHub Wiki

Tokenize

1) 토큰화

a. 문장으로 토큰화

nltk.tokenize.sent_tokenize : 주어진 텍스트를 개별 문장으로 토큰화.

예시

import nltk
from nltk.tokenize import sent_tokenize
    
text = "a! bc. d. e? f~ g)"
text2 = "hi! my name is soyoung. and you? um~ ex)"
print(sent_tokenize(text))
print(sent_tokenize(text2))

결과

['a!', 'bc.', 'd. e?', 'f~ g)']
['hi!', 'my name is soyoung.', 'and you?', 'um~ ex)']

a. 와 같은 건 말머리라고 판단한다.

2) 정규 표현식

a. 정규표현식?

복잡한 문자열을 처리할 때 사용하는 기법

왜 필요한가?

"James 990505-1012345\nTony 940105-1922111"에서 주민등록번호 뒷7자리를 *******로 수정해야할 때
무식하게 처리할 경우

text= "James 990505-1012345\nTony 940105-1922111"
res = []
for t in text.split("\n"):
    res.append(t[:-7]+"*******")
    print("\n".join(res))

정규화를 사용할 경우

import re
          
text= "James 990505-1012345\nTony 940105-1922111"
pat = re.compile("(\d{6})[-]\d{7}")
print(pat.sub("\g<1>-*******", text))

사용법
- [ ? ] : 에 속하는 것 중 하나라도 매치
  - [0-9] : 0~9
  - [a-zA-Z] : 모든 알파벳
  - [^0-9] : 숫자가 아닌 모든 것
- 자주 사용되는 것
  - \d - 숫자와 매치, [0-9]와 동일한 표현식이다.
  - \D - 숫자가 아닌 것과 매치, [^0-9]와 동일한 표현식이다.
  - \s - whitespace 문자와 매치, [ \t\n\r\f\v]와 동일한 표현식이다. 맨 앞의 빈 칸은 공백문자(space)를 의미한다.
  - \S - whitespace 문자가 아닌 것과 매치, [^ \t\n\r\f\v]와 동일한 표현식이다.
  - \w - 문자+숫자(alphanumeric)와 매치, [a-zA-Z0-9_]와 동일한 표현식이다.
  - \W - 문자+숫자(alphanumeric)가 아닌 문자와 매치, [^a-zA-Z0-9_]와 동일한 표현식이다.
- dot
  - a.b : a + 모든문자(최소 하나) + b
  - a[.]b : a + . + b
- *
  - a* : a를 n번(n>=0) 반복
- +
  - a+ : a를 n번(n>0) 반복
- {}
  - {n,m} : n~m회 제한
  - {1,} : +와 동일
  - {0,} : *와 동일
  - a{2} : a를 2번 반복
- ?
  - a? : a가 있어도, 없어도 된다.
- 상세 : https://wikidocs.net/4309

b. 불용어 처리

불용어는 문장의 전체적인 의미에 크게 기여하지 않음.
검색 공간을 줄이기 위해 불용어를 삭제하면 좋다.

3) 토큰의 대체 및 수정

오류를 제거하기 위해 단어 대체 필요. ex) doesn't -> does not

text = "Don't hesitate to ask questions"
    
print(word_tokenize(text)) //['Do', "n't", 'hesitate', 'to', 'ask', 'questions']
print(word_tokenize(replacer.replace(text))) //['Do', 'not', 'hesitate', 'to', 'ask', 'questions']

먼저 대체를 하고 tokenize하는 것이 더 효율적

# 단어 풀어쓰기
class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
    
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            (s, count) = re.subn(pattern, repl, s)
        return s
    
# 중복되는 단어 수정 ex) lottttt -> lot
class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r"(\w*)(\w)\2(\w*)")
        self.repl = r"\1\2\3"
    
    def replace(self, word):
        if wordnet.synsets(word): return word
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word
    
# 대체 가능한 단어 대체
class WordReplacer(object):
    def __init__(self, word_map):
        self.word_map = word_map
    def replace(self, word):
        return self.word_map.get(word, word)