Getting Started - gosukiwi/Pasukon GitHub Wiki
If you don't know what a parser or a gammar is, this is the place for you. Quoting the website:
Parsers derive meaning from arbitrary text. They have many different applications, from programming language creation, to decoding JSON, to translating between different formats.
Pasukon allows you to create a parser by using it's friendly domain-specific language to define a grammar.
A grammar is simply some plain text that is given to Pasukon as input. You can think of it as a tiny programming language (which is what domain-specific language means!).
Pasukon's output is a parser. Parsers takes some text, and report whether that text is valid or not. For example, you can have a parser that matches the character A
. parser.parse('A')
will be valid, and everything else (eg: parser.parse('B')
or parser.parse('BCA')
) won't.
To make such a parser, we can use this grammar:
lex
match A 'A'
/lex
start
| :A
;
The easiest way to try the grammar is using the online editor. It's recommended to use the editor and follow along this guide so you can play around with it as you go.
If you want, you can also do it programatically. Save the as grammar.pasukon
, then load it up and try the generated parser:
const fs = require('fs')
const Pasukon = require('pasukon').Pasukon
const parser = new Pasukon(fs.readFileSync('grammar.pasukon').toString())
parser.parse('A') # => "A"
The grammar is divided in two parts: The lexing step — also known as tokenizing — and the parsing step.
The lexing step is defined between lex
and /lex
tags, and transforms the input into meaningful tokens. Let's see an example:
lex
match A 'A'
match B 'B'
/lex
If we give it the input "AB"
it will return something like [<Token A>, <Token B>, <Token EOF>]
. This step makes it easier (and faster) for the parser to do it's work later on. Each token is an object, which holds the matched string, as well as it's position.
You can also match using Regular Expressions:
lex
match IDENTIFIER /[a-zA-Z][a-zA-Z0-9_-]*/
/lex
The above will match any string starting with a letter, followed by any letter or number, as well as the characters -
and _
. It's important to note that the match is tested from top to bottom, so in this case:
lex
match A 'A'
match IDENTIFIER /[a-zA-Z][a-zA-Z0-9_-]*/
/lex
For the input "ABC"
it will match [<Token A>, <Token "BC">, <Token EOF>]
. If we wanted to match ABC
as IDENTIFIER
, we need to move that match before match A 'A'
.
Besides matching, we can also ignore some particular input, like whitespace. We can use the ignore
keyword:
lex
match A 'A'
match IDENTIFIER /[a-zA-Z][a-zA-Z0-9_-]*/
ignore WHITESPACE /\s+/
/lex
When ignoring, the matched token will be silently skipped. This is useful, as otherwise the lexer will complain that it found some invalid input.
Pasukon allows you to define your own lexer, for when the built-in lexer just isn't enough. Instructions on how to do that is out of the scope of this article, feel free to check out the documentation for that here.
Parsing is where you'll spend most of your time. Before we talk about parsing, let's talk about what a combinator is. A combinator takes one (unary) or two (binary) parsers, and combines them to make something new, returning a new parser.
For example, the many1
unary combinator, takes one parser, and return a new parser, which matches the original parser one or more times. The then
binary combinator takes two parsers, and returns a new parser that matches the first parser, and if it succeeded, it tries to match the other.
With this simple, composable approach you can start simple and build really complex parsers.
The syntax for parsing looks like this:
<rule-name>
| <option-1>
| <option-2>
...
;
All rules return a new parser, which is the result of the combinations specified in it's body. Let's see an example:
lex
match A 'A'
/lex
start
| token A
;
The start
rule returns a new parser, which calls the token
parser with the input A
. The token parser is special, because it can take a single argument, the name of the token to match. In this case, we want to match a single A
, just like our initial example.
What if we want to match an A
or a B
? We can do it like this:
lex
match A 'A'
match B 'B'
/lex
start
| token A
| token B
;
Each option is specified using the pipe (|
) character. Pasukon will try to match from top to bottom. It a branch matches, it will return, and the remaining options are not executed.
Because writing token
can get repetitive, there is a special syntax for it: :
. The grammar above is exactly the same as:
lex
match A 'A'
match B 'B'
/lex
start
| :A
| :B
;
For simple cases like that, it might be easier to read to just manually call the or
combinator:
lex
match A 'A'
match B 'B'
/lex
start
| :A or :B
;
Because or
is a binary combinator, it's put right in the middle of two parsers, just like you'd do 1 + 1
or 2 * 2
. To call an unary combinator, you simply put it before another parser:
lex
match A 'A'
match B 'B'
/lex
start
| many0 (:A or :B)
;
In the grammar above, we call many0
on the parser generated by or
. So we will match A
or B
, zero or more times. So AABB
, BBA
, and any combination of both will work.
We can also call other rules just by using their name:
lex
match A 'A'
match B 'B'
/lex
a-or-b
| :A or :B
;
start
| many0 a-or-b
;
WIP