Getting Started - gosukiwi/Pasukon GitHub Wiki

If you don't know what a parser or a gammar is, this is the place for you. Quoting the website:

Parsers derive meaning from arbitrary text. They have many different applications, from programming language creation, to decoding JSON, to translating between different formats.

Pasukon allows you to create a parser by using it's friendly domain-specific language to define a grammar.

A grammar is simply some plain text that is given to Pasukon as input. You can think of it as a tiny programming language (which is what domain-specific language means!).

Pasukon's output is a parser. Parsers takes some text, and report whether that text is valid or not. For example, you can have a parser that matches the character A. parser.parse('A') will be valid, and everything else (eg: parser.parse('B') or parser.parse('BCA')) won't.

To make such a parser, we can use this grammar:

lex
  match A 'A'
/lex

start
  | :A
  ;

The easiest way to try the grammar is using the online editor. It's recommended to use the editor and follow along this guide so you can play around with it as you go.

If you want, you can also do it programatically. Save the as grammar.pasukon, then load it up and try the generated parser:

const fs = require('fs')
const Pasukon = require('pasukon').Pasukon
const parser = new Pasukon(fs.readFileSync('grammar.pasukon').toString())
parser.parse('A') # => "A"

Lexing

The grammar is divided in two parts: The lexing step — also known as tokenizing — and the parsing step.

The lexing step is defined between lex and /lex tags, and transforms the input into meaningful tokens. Let's see an example:

lex
  match A 'A'
  match B 'B'
/lex

If we give it the input "AB" it will return something like [<Token A>, <Token B>, <Token EOF>]. This step makes it easier (and faster) for the parser to do it's work later on. Each token is an object, which holds the matched string, as well as it's position.

You can also match using Regular Expressions:

lex
  match IDENTIFIER /[a-zA-Z][a-zA-Z0-9_-]*/
/lex

The above will match any string starting with a letter, followed by any letter or number, as well as the characters - and _. It's important to note that the match is tested from top to bottom, so in this case:

lex
  match A 'A'
  match IDENTIFIER /[a-zA-Z][a-zA-Z0-9_-]*/
/lex

For the input "ABC" it will match [<Token A>, <Token "BC">, <Token EOF>]. If we wanted to match ABC as IDENTIFIER, we need to move that match before match A 'A'.

Besides matching, we can also ignore some particular input, like whitespace. We can use the ignore keyword:

lex
  match  A          'A'
  match  IDENTIFIER /[a-zA-Z][a-zA-Z0-9_-]*/
  ignore WHITESPACE /\s+/
/lex

When ignoring, the matched token will be silently skipped. This is useful, as otherwise the lexer will complain that it found some invalid input.

Writing Your Own Lexer

Pasukon allows you to define your own lexer, for when the built-in lexer just isn't enough. Instructions on how to do that is out of the scope of this article, feel free to check out the documentation for that here.

Parsing

Parsing is where you'll spend most of your time. Before we talk about parsing, let's talk about what a combinator is. A combinator takes one (unary) or two (binary) parsers, and combines them to make something new, returning a new parser.

For example, the many1 unary combinator, takes one parser, and return a new parser, which matches the original parser one or more times. The then binary combinator takes two parsers, and returns a new parser that matches the first parser, and if it succeeded, it tries to match the other.

With this simple, composable approach you can start simple and build really complex parsers.

The syntax for parsing looks like this:

<rule-name>
  | <option-1>
  | <option-2>
  ...
  ;

All rules return a new parser, which is the result of the combinations specified in it's body. Let's see an example:

lex
  match A 'A'
/lex

start
  | token A
  ;

The start rule returns a new parser, which calls the token parser with the input A. The token parser is special, because it can take a single argument, the name of the token to match. In this case, we want to match a single A, just like our initial example.

What if we want to match an A or a B? We can do it like this:

lex
  match A 'A'
  match B 'B'
/lex

start
  | token A
  | token B
  ;

Each option is specified using the pipe (|) character. Pasukon will try to match from top to bottom. It a branch matches, it will return, and the remaining options are not executed.

Because writing token can get repetitive, there is a special syntax for it: :. The grammar above is exactly the same as:

lex
  match A 'A'
  match B 'B'
/lex

start
  | :A
  | :B
  ;

For simple cases like that, it might be easier to read to just manually call the or combinator:

lex
  match A 'A'
  match B 'B'
/lex

start
  | :A or :B
  ;

Because or is a binary combinator, it's put right in the middle of two parsers, just like you'd do 1 + 1 or 2 * 2. To call an unary combinator, you simply put it before another parser:

lex
  match A 'A'
  match B 'B'
/lex

start
  | many0 (:A or :B)
  ;

In the grammar above, we call many0 on the parser generated by or. So we will match A or B, zero or more times. So AABB, BBA, and any combination of both will work.

We can also call other rules just by using their name:

lex
  match A 'A'
  match B 'B'
/lex

a-or-b
  | :A or :B
  ;

start
  | many0 a-or-b
  ;

Syntactic Sugar

WIP