Computer Science and the Written Word - sahajss/knowledge_base GitHub Wiki

Background

Stella's and my research project is a Google Chrome extension that uses an online thesaurus API to suggest synonyms for commonly used words. For example, if the user emails the word “argumentative” all the time, our extension may suggest using the word “bellicose” instead. At the same time, our extension keeps track of words and frequency of use.

Zipf's Law

The written word follows a power law: the second most used word appears about half as often as the first most used word. The third most used word appears about a third as often as the first most used word. The fourth most used word appears about a fourth as often and so on...

Zipf's Law applies to all languages, even ones that haven't been translated yet. In English, one out of sixteen words you encounter is the word "the". In the chart below, you can see the relationship between the number of occurrences of a word and it's rank.

Hapax Legomenon

A hapax legomenon is a word that is only used once in a selection of words.

Relevance to Our Project

We're curious if our extension will result in a breakdown of Zipf's law after articles and common verbs/nouns (like "have).

Relevance to You

Now you know a little bit more about how language works! If you ever get a question related to Zipf's Law or Hapax Legomenon on Jeopardy, you will be prepared. :) Here's a video about Zipf's Law if you want to know more: https://www.youtube.com/watch?v=fCn8zs912OE.