word_tokenize - Serbipunk/notes GitHub Wiki
what is tokenize ?
In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning).
在计算机科学中,词汇分析(lexical analysis或lexing或tokenization) 是将一系列“字符序列”(所谓字符序列,例如在程序源代码或网页码中的字符序列)转换为一些列“标志”(所谓标志,是具有差别意义的单词)的过程。
A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.
一个做词汇分析的程序,可能被命名为lexer, tokenizer, 或scanner(尽管scanner是初级阶段的lexer)。一个lexer通常由某个解析器组成(该解析器用来分析源代码的语法,网页的语法或类似的语法)。
nltk.word_tokenize 's implementation
api
def word_tokenize(text, language='english', preserve_line=False) -> [tokens]
word_tokenize._doc_
Return a tokenized copy of *text*,
using NLTK's recommended word tokenizer
(currently an improved :class:`.TreebankWordTokenizer`
along with :class:`.PunktSentenceTokenizer`
for the specified language).
:param text: text to split into words
:type text: str
:param language: the model name in the Punkt corpus
:type language: str
:param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
:type preserve_line: bool
example
s1 = word_tokenize('he argued that they needed more time to finish the project')
>>> ['he',
'argued',
'that',
'they',
'needed',
'more',
'time',
'to',
'finish',
'the',
'project']
s1 = word_tokenize('NULL'+line1.strip('\n'))
nltk install punkt tokenizer
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')