Project 1 Language Modeling Report - fever324/NLP-Projects GitHub Wiki

  1. Unsmoothed n-grams. For this part, we implement the unsmoothed unigram and bigram language model using given text corpus. We train the language models based on different genres of books, which means considering every genre as a single corpus. We use Natural Language Toolkit 3.0 to implement word tokenization on the raw text:
From above codes, we will have a list of tokens from corpus. To get unigram language model (unsmoothed), 2 values need to be computed for one token: the frequency of specific tokens, count(token), and the number of all words in the corpus, count(words). Thus, the probability of each token would be: To get bigram language model (unsmoothed), we use the method similar to implementing unigram language model. We count the occurrence of every bigram tokens in the corpus respectively, and then divide them by the whole occurrence of first word of a bigram. At first the corpus is processed line by line, although the result reaches our expectation, we find that this approach ignores some bigrams that cross between two lines. Thus, we decide to read the whole corpus for one time. This modification leads to a little bit improvement on the final genre classification result. Data Structure. We choose dictionary to store token and their probabilities as key-value pairs for both unigram and bigram language model.
  1. Random Sentence Generation
⚠️ **GitHub.com Fallback** ⚠️