Query Language - ngrams-dev/general GitHub Wiki
NGRAMS' query language is similar to a very simple form of regular expression. A query, formed as a sequence of terms and query operators, is matched against all indexed ngrams in the selected corpus.
The query
how * you *
returns
2468812 How do you know
845786 How do you feel
830062 How did you know
755042 How are you ?
739312 How do you do
...
Results are sorted by the ngram's total match count in descending order. See Data Model for details.
Because the underlying ngrams are between 1 and 5 terms long, a query must be matchable within this range. For example, a query such as one two three four five six
with 6 terms will never match. Sometimes a term will be split further by NGRAMS in order to match how the underlying ngrams were tokenized by Google. In this case a query can exceed the limit of 5 terms. More about that in Tokenization.
Query Operators
Here is the complete list of query operators.
Operator | Name | Description | Example |
---|---|---|---|
* |
Star | Matches one term. | what * day |
** |
StarStar | Matches zero or more terms. | what ** day |
a / b |
Alternation | Matches either a or b — see Alternation. | what a sunny / rainy day |
"a b" |
TermGroup | Treats multiple terms as one entity — see TermGroup. | you are / "will be" doing |
prefix~ |
Completion | Matches terms starting with prefix. | what an aw~ day |
*_ADJ |
Star_ADJ | Matches one adjective. | I feel *_ADJ |
*_ADP |
Star_ADP | Matches one adposition (preposition or postposition). | working *_ADP home |
*_ADV |
Star_ADV | Matches one adverb. | she sings *_ADV |
*_CONJ |
Star_CONJ | Matches one conjunction. | tea *_CONJ coffee |
*_DET |
Star_DET | Matches one determiner or article. | go *_DET way |
*_NOUN |
Star_NOUN | Matches one single noun. | buy some *_NOUN |
*_NUM |
Star_NUM | Matches one numeral. | buy *_NUM bottles |
*_PRON |
Star_PRON | Matches one pronoun. | bring *_PRON flowers |
*_PRT |
Star_PRT | Matches one particle. | to step *_PRT |
*_VERB |
Star_VERB | Matches one verb. | I *_VERB you |
_START_ |
SentenceStart | Matches the start of a sentence.See Sentence Boundary Tags. | _START_ as expected * |
_END_ |
SentenceEnd | Matches the end of a sentence.See Sentence Boundary Tags. | as expected * _END_ |
Part-of-speech wildcards like *_ADJ
will only match 2-grams and 3-grams. Longer ngrams have not been tagged by Google. See Ngram Types for details.
Alternation
An alternation checks multiple terms at once. The /
can be read like a logical OR operator.
what a sunny / rainy / windy *
checks
what a sunny *
what a rainy *
what a windy *
TermGroup
A term group treats zero or more terms as one entity. It is only useful within an alternation, i.e. as the left or right side of the /
operator. A term group can also be empty to let an alternation check for the empty string.
you are / "will be" / "" doing
checks
you are doing
you will be doing
you doing
When a query is parsed from left to right, a term like "
or "foo
(opening term) is interpreted as the start of a term group. The next "
or bar"
(closing term) is interpreted as the end of this term group. It is an error if a closing term comes before an opening term. Unintended opening and closing terms can be avoided using an escape sequence.
Sentence Boundary Tags
_START_
and _END_
are artificial terms that were inserted by Google after sentence detection. _START_
is placed before the first word of a sentence. _END_
is placed after the punctuation mark that finishes a sentence. As the ngrams in the dataset do not span across sentence boundaries, you will not find these terms in the middle of an ngram.
Escape Sequences
To disable the semantics of characters used as operators, you have to backslash-escape them. For example, to search for a literal *
you have to enter \*
. Here is the complete list of escape sequences:
Operator | Escape Sequence | Note |
---|---|---|
* |
\* |
|
** |
\** |
|
/ |
\/ |
|
"a |
\"a |
a can be any term or empty. |
b" |
b\" |
b can be any term or empty. |
prefix~ |
prefix\~ |
prefix can be any term or empty. |
*_ADJ |
\*_ADJ |
Same for other part-of-speech wildcards. |
_START_ |
\_START_ |
|
_END_ |
\_END_ |
Tokenization
When Google compiled the dataset, they applied some tokenization (and normalization) to the raw material. For example, punctuation marks that usually follow directly after a term are split off to form a separate term, e.g. hello!
became hello !
. Another example is the splitting of contractions like they're
into they 're
for linguistic reasons.
In order to match a query against these kind of ngrams, the query has to be tokenized the same way. NGRAMS does this by default. However, this has the effect that a query can exceed the number of terms, which is 5, and is not matchable anymore. In this case NGRAMS responds with an error. You then have to make your query shorter.
You can turn auto-tokenization off in search settings.