NLP Ch2 Accessing Text Corpora and Lexical Resources - PeppermintT/Learning-NLP GitHub Wiki

Text Corpus Structure

There are different types of structures for text corpuses.

A collection of texts is the simplest kind - eg the Gutenberg corpus which simply is texts that are out of copyright.
Some texts are grouped into categories. There are many types of categories eg genre, language, source, author (eg the Brown corpus)
Sometimes categories overlap as items can fall under 2 or more categories (eg Reuters news articles)
Temporal structures - showing language over time (eg the Inaugral address corpus for the USA).

Some key functions for accessing corpuses ( I worked with gutenberg)

Raw

The raw() function gives us the contents of the file without any linguistic processing. So, for example, len(gutenberg.raw('blake-poems.txt')) tells us how many letters occur in the text, including the spaces between words. This is termed "the raw content of the corpus"
raw(fileids=[f1,f2,f3]) the raw content of the specified files
raw(categories=[c1,c2]) the raw content of the specified categories

Fileids

fileids() the files of the corpus fileids([categories]) the files of the corpus corresponding to these categories

Sentences

sents() the sentences of the whole corpus sents(fileids=[f1,f2,f3]) the sentences of the specified fileids sents(categories=[c1,c2]) the sentences of the specified categories

Words

words() the words of the whole corpus words(fileids=[f1,f2,f3]) the words of the specified fileids words(categories=[c1,c2]) the words of the specified categories