Query Language - ngrams-dev/general GitHub Wiki

NGRAMS' query language is similar to a very simple form of regular expression. A query, formed as a sequence of terms and query operators, is matched against all indexed ngrams in the selected corpus.

The query

how * you *

returns

2468812   How do you know
 845786   How do you feel
 830062   How did you know
 755042   How are you ?
 739312   How do you do
...

Results are sorted by the ngram's total match count in descending order. See Data Model for details.

Because the underlying ngrams are between 1 and 5 terms long, a query must be matchable within this range. For example, a query such as one two three four five six with 6 terms will never match. Sometimes a term will be split further by NGRAMS in order to match how the underlying ngrams were tokenized by Google. In this case a query can exceed the limit of 5 terms. More about that in Tokenization.

Query Operators

Here is the complete list of query operators.

Operator Name Description Example
* Star Matches one term. what * day
** StarStar Matches zero or more terms. what ** day
a / b Alternation Matches either a or b — see Alternation. what a sunny / rainy day
"a b" TermGroup Treats multiple terms as one entity — see TermGroup. you are / "will be" doing
prefix~ Completion Matches terms starting with prefix. what an aw~ day
*_ADJ Star_ADJ Matches one adjective. I feel *_ADJ
*_ADP Star_ADP Matches one adposition (preposition or postposition). working *_ADP home
*_ADV Star_ADV Matches one adverb. she sings *_ADV
*_CONJ Star_CONJ Matches one conjunction. tea *_CONJ coffee
*_DET Star_DET Matches one determiner or article. go *_DET way
*_NOUN Star_NOUN Matches one single noun. buy some *_NOUN
*_NUM Star_NUM Matches one numeral. buy *_NUM bottles
*_PRON Star_PRON Matches one pronoun. bring *_PRON flowers
*_PRT Star_PRT Matches one particle. to step *_PRT
*_VERB Star_VERB Matches one verb. I *_VERB you
_START_ SentenceStart Matches the start of a sentence.See Sentence Boundary Tags. _START_ as expected *
_END_ SentenceEnd Matches the end of a sentence.See Sentence Boundary Tags. as expected * _END_

Part-of-speech wildcards like *_ADJ will only match 2-grams and 3-grams. Longer ngrams have not been tagged by Google. See Ngram Types for details.

Alternation

An alternation checks multiple terms at once. The / can be read like a logical OR operator.

what a sunny / rainy / windy * checks

  • what a sunny *
  • what a rainy *
  • what a windy *

TermGroup

A term group treats zero or more terms as one entity. It is only useful within an alternation, i.e. as the left or right side of the / operator. A term group can also be empty to let an alternation check for the empty string.

you are / "will be" / "" doing checks

  • you are doing
  • you will be doing
  • you doing

When a query is parsed from left to right, a term like " or "foo (opening term) is interpreted as the start of a term group. The next " or bar" (closing term) is interpreted as the end of this term group. It is an error if a closing term comes before an opening term. Unintended opening and closing terms can be avoided using an escape sequence.

Sentence Boundary Tags

_START_ and _END_ are artificial terms that were inserted by Google after sentence detection. _START_ is placed before the first word of a sentence. _END_ is placed after the punctuation mark that finishes a sentence. As the ngrams in the dataset do not span across sentence boundaries, you will not find these terms in the middle of an ngram.

Escape Sequences

To disable the semantics of characters used as operators, you have to backslash-escape them. For example, to search for a literal * you have to enter \*. Here is the complete list of escape sequences:

Operator Escape Sequence Note
* \*
** \**
/ \/
"a \"a a can be any term or empty.
b" b\" b can be any term or empty.
prefix~ prefix\~ prefix can be any term or empty.
*_ADJ \*_ADJ Same for other part-of-speech wildcards.
_START_ \_START_
_END_ \_END_

Tokenization

When Google compiled the dataset, they applied some tokenization (and normalization) to the raw material. For example, punctuation marks that usually follow directly after a term are split off to form a separate term, e.g. hello! became hello !. Another example is the splitting of contractions like they're into they 're for linguistic reasons.

In order to match a query against these kind of ngrams, the query has to be tokenized the same way. NGRAMS does this by default. However, this has the effect that a query can exceed the number of terms, which is 5, and is not matchable anymore. In this case NGRAMS responds with an error. You then have to make your query shorter.

You can turn auto-tokenization off in search settings.