word_tokenize - Serbipunk/notes GitHub Wiki

what is tokenize ?

wikipedia:

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning).

在计算机科学中,词汇分析(lexical analysis或lexing或tokenization) 是将一系列“字符序列”(所谓字符序列,例如在程序源代码或网页码中的字符序列)转换为一些列“标志”(所谓标志,是具有差别意义的单词)的过程。

A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

一个做词汇分析的程序,可能被命名为lexer, tokenizer, 或scanner(尽管scanner是初级阶段的lexer)。一个lexer通常由某个解析器组成(该解析器用来分析源代码的语法,网页的语法或类似的语法)。

nltk.word_tokenize 's implementation

api

def word_tokenize(text, language='english', preserve_line=False) -> [tokens]

word_tokenize._doc_

    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently an improved :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into words
    :type text: str
    :param language: the model name in the Punkt corpus
    :type language: str
    :param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
    :type preserve_line: bool

example

s1 = word_tokenize('he argued that they needed more time to finish the project')

>>> ['he',
 'argued',
 'that',
 'they',
 'needed',
 'more',
 'time',
 'to',
 'finish',
 'the',
 'project']

s1 = word_tokenize('NULL'+line1.strip('\n'))

nltk install punkt tokenizer

  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

reference of punkt

punkt-nltk-python doc