EBNF: How to Describe the Grammar of a Language - go-sqlparser/current GitHub Wiki

EBNF: How to Describe the Grammar of a Language

Written by Federico Tomassetti
in Parsing

The EBNF is the most commonly used formalism to describe the structure of languages.

In this article we are going to see:

Does it sound like a plan?

What Is the EBNF?

The EBNF is a way to specify a formal language grammar. It can be considered a metalanguage because it is a language to describe other languages.

A formal language is a language with a precise structure, like programming languages, data languages, or Domain Specific Languages (DSLs). Java, XML, and CSS are all examples of formal languages.

A grammar can be used to define two opposite things:

how to recognize the different portions in a piece of code written in the formal language
the possible ways to build a valid piece of code in the formal language

For example, a simple grammar could tell us that a document in our language is composed by a list of declarations, and declarations are defined by the sequence of a keyword Xyz and a a name.

Based on this we could:

recognize in the code sequences of the keyword _Xyz _and a name as instances of these declarations we have considered
we could generate valid documents in our language by writing a list of declarations, each time inserting a keyword Xyz and a name

While there are two possible usages for a grammar, we are typically interested only in the first one: recognizing if a piece of code is valid for a given language and identifying the different structures typical of the language (like functions, methods, classes, etc.).

What EBNF Means?

Ok, but what EBNF stands for? EBNF stands for Extended Backus-Naur Form. It will not surprise you to read that it is an extended version of the Backus-Naur form (BNF).

There is at least another format derived from BNF which is called ABNF, for Augment Backus-Naur Form. ABNF main purpose is to describe bidirectional communications protocols. EBNF is the most used variant of the format.

While there is a standard for EBNF it is common to see different extensions or slightly different syntaxes to be used. In the rest of the article we will add more comments when looking a specific parts of EBNF.

Why BNF Is Not Enough?

EBNF was invented to overcome the limitations of the base format. The main one is the non-existing support to easily define repetitions. That means that with BNF common patterns, like defining a series of repeatable elements, is cumbersome and relies on counter-intuitive logical math.

For example, to define a list of words separated by a comma (e.g., john, coffee, logic) you would like to say something similar to "a list is one word followed by a many pairs of comma and word". You can say like that with EBNF. Instead in the basic BNF format there is no equivalent of "many". So to describe the same thing you would have to say something like "a list is one word or a list followed by a pair of comma and word". This works, but it is complicated because it does not define one list, but a nested series of lists.

Basically, the previous example would be "john, coffee, logic is a list of john and , followed by list of coffee and , and logic".

Examples of EBNF Grammars

We are going to see some examples of grammars taken from a list available on github. Later we could refer to them while explaining the rules.

Note also that for each language you could have different equivalent grammars. So for the same languages you could find grammars which are not exactly the same and yet they are correct anyway.

How We Typically Define a Grammar Using EBNF?

We define a grammar by specifying how to combine single elements in significant structures. As a first approximation we can consider single words to be elements. The structures correspond to sentences, periods, paragraphs, chapters, and entire documents.

A grammar tells us what are the correct ways to put together single words to obtain sentences, and how we can progress by combining sentences into periods, periods into paragraphs and so on until we get the entire document.

EBNF: Terminals and Non-Terminals

Terminals and Non-Terminals

In the approximation we used, single words correspond to terminals, while all the structures built on top of them (sentences, periods, paragraphs, chapters, and entire documents) correspond to non-terminals.

How Terminals Look Like

Terminals are sometimes also called tokens. They are the smallest block we consider in our EBNF grammars.

A terminal could be either:

a quoted literal
a regular expression
a name referring to a terminal definition. This option is not part of the EBNF standard, but it is used very frequently. Typically uppercase names are used for these terminal definitions, to distinguish them from non-terminal definitions. These latter definitions are the proper production rules

For example, in the grammar you could use "when" to indicate that exact string. Also regular expressions are used, like /[a-z]+/. Finally we could group terminal definitions somewhere and then use their names to refer to them. Using definitions have the advantage of permitting to reuse them multiple times.

Let's see some typical terminals:

identifiers: these are the names used for variables, classes, functions, methods and so on. Typically most languages use different conventions for different names. For example in Java class names start with an upper case letter, static constants are written using all upper case letters, while methods and variable names start with a lowercase letter. However these are just best practices: in Java there is just one type of identifier that can be used everywhere. This is not the case for all the languages. In languages like Haskell, identifiers used for types must start with an uppercase letter. Another thing to consider is that the definition of identifiers typically overlaps with the definitions of keywords. The latters should take precedence. I.e., if a token could be either an identifier or a keyword than it should be recognized as a keyword.
keywords: almost every language uses keywords. They are exact strings that are used to indicate the start of a definition (think about class in Java or def in Python), a modifier (public, private, static, final, etc.) or control flow structures (while, for, until, etc.)
literals: these permit to define values in our languages. We can have string literals, numeric literal, char literals, boolean literals (but we could consider them keywords as well), array literals, map literals, and more, depending on the language
separators and delimiters: like colons, semicolons, commas, parenthesis, brackets, braces
whitespaces: spaces, tabs, newlines. They are typically not meaningful and they can be used everywhere in your code
_comments: _most languages have one or more ways to define comments and commonly they can be used everywhere. Some languages could have more structured forms of documentation comments

Whitespaces and comments are typically ignored in EBNF grammars. This is because usually they could be used everywhere in the language, so they should be reported all over the grammar. Some tools have specific options to mark some terminals as terminals to ignore.

How Non-Terminals Look Like

Non-terminals are obtained by grouping terminals and other non-terminals in a hierarchy. After all our goal is to obtain an Abstract Syntax Tree, which is a tree. Our tree will have a root: one non-terminal representing our entire document. The root will contain other non-terminals that will contain other non-terminals and so on.

The picture below show how we can go from a stream of tokens (or terminals) to an AST, which groups terminals into a hierarchy of non-terminals.

EBNF - From stream of tokens to AST
EBNF -- From stream of tokens to AST

In the grammar we will define the parser rules that determine how the AST is built.

We have seen that non-terminals represent structures at different levels. Examples of non-terminals are:

program/document: represent the entire file
_module/classes: _group several declarations togethers
_functions/methods: _group statements together
_statements: _these are the single instructions. Some of them can contain other statements. For example the loops have a body which is a list of other statements
_expressions: _are typically used within statements and can be composed in various ways

How non terminals are defined? We are going to see it in the next section.

Definition of Production Rules

An EBNF grammar is substantially a list of production rules. Each production rule tells us how a non-terminal can be composed. We are now going to see the different elements that we can use in such rules.

.
.
.

Syntactic vs Semantic Validity

We have seen that the EBNF can be used to define a grammar, we should consider that a grammar defines what is syntactically valid, but it tells us nothing about what is semantically valid.

What does it mean? Consider these examples:

a document containing multiple declarations of elements with the same name
a sum between two variables: an integer and a boolean
a reference to a variable that was not declared before

We can imagine a language where all these examples are syntactically correct, i.e., they are built according to the grammar of the language. However they are all semantically incorrect, because they do not respect additional constraints on how the language should be used. These semantics constraints are used in addition to the grammar, after the structures of the language have been recognized. In the grammar we can say things like a declaration should be composed by a certain keyword and a name, with semantic constraints we can say two declarations should not have the same name. By combining syntactic and semantic rules we can express what is valid in our language.

How do you define semantic rules? By writing code that works on the AST. You cannot do that in the EBNF grammar.

Where to Go From Here?

An EBNF grammar is useful to support discussion and to communicate with other langue designers. Typically you may want to do more with it: you want to build actual parsers and tools to process languages from your EBNF grammars. I personally like to work with ANTLR. My suggestion is to take a look at the ANTLR Mega Tutorial.

You could be interested in learning more about parsing. We have prepared different articles explaining how to proceed depending on the language you prefer to work with:

And if you are lost and unsure about how to move forward just let us know.

Summary

In this article we aimed to be practical: we have discussed what is done in practice and what you need to start writing grammars using EBNF. We have discussed the differences between typical usages and the standard and tried to give a complete and correct picture.

There are things we did not discuss: for example the history that lead to the creation of EBNF or its roots in the work from Chomsky and other scientists. We have not discussed what a context free grammar is. The theory tells us that EBNF cannot be used to describe all possible forms of grammars. However it is powerful enough to describe all formal languages you should be interested in writing.

In the rest of this website you should find articles about the next steps like building a compilers or an editor for your language. So, have fun!