Oil Parser Generator Project - oilshell/oil GitHub Wiki

(Back to Tasks Under NLNet Grant)

This is an introduction to an important subproject of https://www.oilshell.org/

(Note that we also need help on the Python-to-C++ translator. This work is separate from that. It involves parsing, but otherwise isn't strictly related.)

Description

Oil is developed "middle out" -- it has an "executable spec" in Python, which is then semi-automatically translated to C++.

Much of the code works in C++, but the expression parser does not. It's something of a special case.

Oil borrows the parsing approach of CPython, which is LL parsing (in the same family as ANTLR.) Background: The origins of pgen by Guido.


In Oil's Python implementation, the oil_lang/grammar_gen.py tool reads the grammar oil_lang/grammar.pgen2. It produces parse tables in Python's "marshal" format. At runtime, the pgen2/ library reads it.

So instead of outputting these Python data structures, we want to output C data structures just like Python itself does it. (before Python 3.8, when they switched to PEG.)

We want the pgen-native/ parser runtime to be linked into the Oil executable. It should

  1. Take input from Oil's lexer
  2. Parse the token stream, using the generated parse tables. This produces a parse tree.
  3. We have separate code to translate the parse tree to an AST, which can be pretty printed using the bin/oil -n flag (see below).

Background

How to Parse Shell Like a Programming Language explains our parsing approach. This already works in Python:

$ bin/oil --ast-format text -n -c 'echo "hello $name"'
(command.Simple
  words: [
    (compound_word parts:[(Token id:Id.Lit_Chars span_id:0 val:echo)])
...

And it's already translated to C++:

$ _bin/cxx-dbg/osh_eval  -n -c 'echo "hello $name"'
(command.Simple
  words: [
    (compound_word parts:[(Token id:Id.Lit_Chars span_id:0 val:echo)])
...

This part does not use pgen2, because it's just shell. The Oil language has a var keyword, and that parse uses pgen2:

~/git/oilshell/oil$ bin/oil --ast-format text -n -c 'var x = 1 + 2 * 3'
(command.VarDecl
  keyword: (Token id:Id.KW_Var span_id:0 val:var)
  lhs: [(name_type name:(Token id:Id.Expr_Name span_id:2 val:x))]
  rhs: 
    (expr.Binary
      op: (Token id:Id.Arith_Plus span_id:8 val:_)
      left: (expr.Const c:(Token id:Id.Expr_DecInt span_id:6 val:1))
      right: 
        (expr.Binary

However it crashes in C++:

$ _bin/cxx-dbg/osh_eval --ast-format text -n -c 'var x = 1 + 2 * 3'
osh_eval: cpp/pgen2_parse.cc:8: void parse::Parser::setup(int): Assertion `0' failed.
Aborted (core dumped)

So this is what we want to work.

Data Snippets

~/git/oilshell/oil/Python-2.7.13$ head -n 15  Python/graminit.c 
/* Generated by Parser/pgen */

#include "pgenheaders.h"
#include "grammar.h"
PyAPI_DATA(grammar) _PyParser_Grammar;
static arc arcs_0_0[3] = {
    {2, 1},
    {3, 1},
    {4, 2},
};
static arc arcs_0_1[1] = {
    {0, 1},
};
static arc arcs_0_2[1] = {
    {2, 1},

Acceptance Tests

These test run against Python, but I can make them run against C++ (the .asan variant).

$ oil_lang/run.sh soil-run

Relevant Files

  • cpp/pgen2_parse.cc -- where the code currently hits assert(0) if you try to parse Oil code.
  • oil_lang/grammar.pgen2 -- the grammar
  • oil_lang/grammar_gen.py
    • this currently generates Python data structures, but we want to generate C data structures, like Python's pgen.c does
    • Python data structures: _devbuild/gen/grammar.marshal and _devbuild/gen/grammar_nt.py (non-terminals)
  • oil_lang/expr_parse.py -- a wrapper for the generated parser
  • The pgen2/ directory
    • parse.py and more
  • pgen-native/ dir -- this is just a copy of Python, imported by a contributor
  • test/parse-errors.sh -- this can be used as an acceptance test.
    • it is run in the continuous build in Python and C++ : http://travis-ci.oilshell.org/github-jobs/
      • dev-minimal > parse-errors
      • cpp > parse-errors
    • Right now there are if _is-oil-native guards to prevent the assert(0) from failing in C++!

Related