nltkFirstNote - juedaiyuer/researchNote GitHub Wiki
NLTK初步笔记
1. 安装
# pip 安装
$ sudo pip install nltk
# 源代码安装
$ sudo python setup.py install
2. 入门实践
需要下载Collections-book,从 NLTK 的 book 模块加载所有的东西
from nltk.book import *
实际测试
>>> import nltk
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
找到文本,只需要在python解释器输入名字
>>> text1
<Text: Moby Dick by Herman Melville 1851>
2.1 搜索文本
文本名.concordance('...')
>>> text1.concordance('monstrous')
运行结果
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
哪些词出现在相似的上下文,文本名.similar('...')
函数common_contexts允许我们研究两个或两个以上的词共同的上下文
文本名.common_contexts(['...','...'])
离散图
>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
产生随机文本
文本名.generate()
2.2 计数词汇
以文本中出现的词和标点符号为单位算出文本从头到尾的长度,《创世纪》中使用的例子
>>> len(text3)
44764
set(文本名),获取文本的词汇表,出现的不同单词的集合
>>> sorted(set(text3))
>>> len(set(text3))
2789
sorted(),得到一个词汇项的排序表,这个表以各种标点符号开始,然后是以 A 开头的词汇。大写单词排在小写单词前面。我们通过求集合中项目的个数间接获得词汇表的大小。
一个词类型是指一个词在一个文本中独一无二的出现形式或拼写。也就是说,这个词在词汇表中是唯一的。我们计数的2,789个项目中包括标点符号,所以我们把这些叫做唯一项目类型而不是词类型。
计数一个词在文本中出现的次数,计算一个特定的词在文本中占据的百分比
>>> text3.count("smote")
5
>>> 100*text3.count("smote")/len(text3)
0
词汇多样性
>>> def lexical_diversity(text):
... return len(text)/len(set(text))
...
>>> lexical_diversity(text1)
13
>>> lexical_diversity(text2)
20
2.3 链表
python里面的链表,存储文本的方式
>>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>> sent1
['Call', 'me', 'Ishmael', '.']
>>> len(sent1)
4
>>> lexical_diversity(sent1)
1
链表的加法运算,这种加法的特殊用途叫做连接;它将多个链表组合为一个链表。我们可以把句子连接起来组成一个文本
>>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']
['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']
向链表中增加一个元素,这种操作叫做追加
>>> sent1.append("Some")
>>> sent1
['Call', 'me', 'Ishmael', '.', 'Some']
source
- 探索 Python、机器学习和 NLTK 库
- 用Python进行自然语言处理(中文).pdf 第一章