NLP Ch2 Accessing Text Corpora and Lexical Resources - PeppermintT/Learning-NLP GitHub Wiki
Text Corpus Structure
There are different types of structures for text corpuses.
- A collection of texts is the simplest kind - eg the Gutenberg corpus which simply is texts that are out of copyright.
- Some texts are grouped into categories. There are many types of categories eg genre, language, source, author (eg the Brown corpus)
- Sometimes categories overlap as items can fall under 2 or more categories (eg Reuters news articles)
- Temporal structures - showing language over time (eg the Inaugral address corpus for the USA).
Some key functions for accessing corpuses ( I worked with gutenberg)
Raw
- The raw() function gives us the contents of the file without any linguistic processing. So, for example, len(gutenberg.raw('blake-poems.txt')) tells us how many letters occur in the text, including the spaces between words. This is termed "the raw content of the corpus"
- raw(fileids=[f1,f2,f3]) the raw content of the specified files
- raw(categories=[c1,c2]) the raw content of the specified categories
Fileids
fileids() the files of the corpus fileids([categories]) the files of the corpus corresponding to these categories
Categories
categories() the categories of the corpus categories([fileids]) the categories of the corpus corresponding to these files
Sentences
sents() the sentences of the whole corpus sents(fileids=[f1,f2,f3]) the sentences of the specified fileids sents(categories=[c1,c2]) the sentences of the specified categories
Words
words() the words of the whole corpus words(fileids=[f1,f2,f3]) the words of the specified fileids words(categories=[c1,c2]) the words of the specified categories