nltkFirstNote - juedaiyuer/researchNote GitHub Wiki

NLTK初步笔记

1. 安装

# pip 安装
$ sudo pip install nltk

# 源代码安装
$ sudo python setup.py install

2. 入门实践

需要下载Collections-book,从 NLTK 的 book 模块加载所有的东西

from nltk.book import *

实际测试

>>> import nltk
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

找到文本,只需要在python解释器输入名字

>>> text1
<Text: Moby Dick by Herman Melville 1851>

2.1 搜索文本

文本名.concordance('...')

>>> text1.concordance('monstrous')

运行结果

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

哪些词出现在相似的上下文,文本名.similar('...')

函数common_contexts允许我们研究两个或两个以上的词共同的上下文

文本名.common_contexts(['...','...'])

离散图

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

nltk/figure_1.png

产生随机文本

文本名.generate()

2.2 计数词汇

以文本中出现的词和标点符号为单位算出文本从头到尾的长度,《创世纪》中使用的例子

>>> len(text3)
44764

set(文本名),获取文本的词汇表,出现的不同单词的集合

>>> sorted(set(text3))
>>> len(set(text3))
2789

sorted(),得到一个词汇项的排序表,这个表以各种标点符号开始,然后是以 A 开头的词汇。大写单词排在小写单词前面。我们通过求集合中项目的个数间接获得词汇表的大小。

一个词类型是指一个词在一个文本中独一无二的出现形式或拼写。也就是说,这个词在词汇表中是唯一的。我们计数的2,789个项目中包括标点符号,所以我们把这些叫做唯一项目类型而不是词类型。

计数一个词在文本中出现的次数,计算一个特定的词在文本中占据的百分比

>>> text3.count("smote")
5
>>> 100*text3.count("smote")/len(text3)
0

词汇多样性

>>> def lexical_diversity(text):
...     return len(text)/len(set(text))
...
>>> lexical_diversity(text1)
13
>>> lexical_diversity(text2)
20

2.3 链表

python里面的链表,存储文本的方式

>>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>> sent1
['Call', 'me', 'Ishmael', '.']
>>> len(sent1)
4
>>> lexical_diversity(sent1)
1

链表的加法运算,这种加法的特殊用途叫做连接;它将多个链表组合为一个链表。我们可以把句子连接起来组成一个文本

>>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']
['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']

向链表中增加一个元素,这种操作叫做追加

>>> sent1.append("Some")
>>> sent1
['Call', 'me', 'Ishmael', '.', 'Some']

source