N gram - sagr4019/ResearchProject GitHub Wiki

N-gram

N-grams take part in text processing. Texts get splitted into fragments (e. g. words in a sentence or letters of words). Those fragments are used together in a row with a specified number of fragments (n). N-grams are used to analyze texts and to predict the next fragment. They are used in many categories such as machine translation or spelling correction.

  • n=1 is called "unigrams"
  • n=2 is called "bigrams"
  • n=3 is called "trigrams"
  • n=4, n=5, ... are usually called four-grams, five-grams, ...

E. g. given a sentence "To be or not to be.".

Bigrams (n=2) would look like this:

"To be"
"be or"
"or not"
...

Trigrams (n=3) would look like this:

"To be or"
"be or not"
"or not to"
...

References

https://de.wikipedia.org/wiki/N-Gramm

https://en.wikipedia.org/wiki/N-gram

http://text-analytics101.rxnlp.com/2014/11/what-are-n-grams.html