Parsing - sympy/sympy GitHub Wiki

Parsing

Here are some things related to parsing heuristics, e.g., for use at http://gamma.sympy.org. Much of this is adapted from http://groups.google.com/group/sympy/browse_thread/thread/ac643c2f9676e98d.

Types of things to parse

These are listed in the order of importance.

SymPy syntax: This is probably obvious, but correct SymPy/Python syntax should always be parsed exactly as it is given. If the heuristic parser has ambiguities problems that would prevent this, we can just preparse with a call to sympify(), and only use heuristics if that fails.
Mathematica, Maple, Maxima, etc. syntax. Where they conflict, we should pick the more popular variant, or if that's nontrivial, we can just let it decide as a side-effect of the implementation (i.e., leave the behavior undefined in that case).
LaTeX. The ability to parse LaTeX math into something that can be computed would be very useful. WolframAlpha has support for this. The young package, latex2sympy, supports parsing a subset of latex.
Text based math. What I mean here is, it should support parsing things as you would type them in plain text without trying to follow any kind of set grammar. Stuff like 3x^2 + 2, d/dx x^2.
Special symbols: Support stuff like √x or ∫x^2 dx. Allow, to some degree, pasting in stuff from the SymPy pretty printer (in particular, things that are not printed on more than one line, like 3⋅x₃).
Text based functions: Stuff like "integrate x^2 dx", "limit x^2 with respect to x as x approaches infinity".
Natural language processing: There is a vagary between this and the last bullet point. What I mean here is that it tries to guess what is meant from a plain text description without using a set grammar. This could even support stuff like "the integral of x squared with respect to x".

Parsing techniques

Operator-precedence

Somewhat limited: You cannot have operator symbols whose precedence depends on context.
This bites, for example, unary minus as defined in most programming languages. There are ways to deal with this, at the expense of complicating the parser (a bit) and the grammar specification (somewhat).

Extending the grammar requires the ability to grasp the concepts of precedence and associativity.

The parser can easily generate useful error messages.

Relatively easy to implement.

LL(k)

Surprisingly limited, it cannot parse any input that starts with k or more left parentheses, or any grammar that has more than k precedence levels with infix or postfix operators, or any grammar with a left-recursive rule such as Exp ::= Exp ^ Constant | Constant (as this example shows, left-recursive rules are required for right-associative operators).
There are ways around this, but restrictions on the allowable or advisable forms of a grammar remain.

Extending the grammar requires the ability to compute the sets of prefixes, which is moderately difficult and mostly doable in one's head.

The parser has a hard time generating useful error messages.

Implementation is moderately complex.

SLR

Only marginally less limited than operator-precedence.

Otherwise, the same advantages and disadvantages as LR(k) hold.

Generally not worth it.

LALR(k)

Almost as powerful as LR(k+1) in terms of grammars it can handle.

Generating error messages is somewhat more difficult than with LR(k).

Otherwise, the same advantages and disadvantages as LR(k) hold.

Usually considered the best compromise between implementation effort and power if the language grammar can be brought into LR form.

LR(k)

This is the most powerful widely-known algorithm for which ambiguities can be detected by inspecting the grammar.

Extending the grammar requires intimate knowledge of the parsing algorithm; otherwise, you'll be unable to interpret the "shift-reduce" and "reduce-reduce" conflict error messages.

Generating useful error messages is moderately difficult.

Implementation is complex: transforming the grammar into the tables that the parser needs is an expensive operation, and usually done as a separate compilation step.

Earley

Can handle arbitrary context-free grammars.

Cannot check whether a grammar is ambiguous. Will find out if a parse for a concrete input is ambiguous and can return the valid parses.

Extending the grammar is trivial: just add more grammar rules.
However, it is easy to add unwanted ambiguities.
If the grammar is under evolution, this means that you cannot persist a text and assume that it will parse exactly the same tomorrow: newly added grammar rules may have introduced alternative parses. (The consequence is to store the parsed form.)

I don't know how hard generating useful error messages is.
I suspect it's hard or impossible, but I'd have to ask a search engine to verify that.

Implementation is easy (150..200 lines of Python code suffice).
The algorithm is worst-case polynomial in input length: quadratic if the grammar is unambiguous, cubic if it is ambiguous.
This is not a serious issue for small inputs.
There are variants of Earley that have better worst-case complexities, but these are more complex.

Suitability of parsing techniques

It is expected that combining the grammar rules of such diverse sources as SymPy/Python, multiple symbolic math packages, and LaTeX will result in an ambiguous syntax.

The only technique that can handle an ambiguous syntax is the Earley parser.

If we wish an unambiguous technique, we'd have to disambiguate the grammars.
Which is hard, a huge amount of work, and requires a medium-to-large amount of work for every addition to the syntax.

The other alternative would be to give up on the ability to parse all variants.
The third, let the user choose which grammar to apply. This still requires a lot of work to disambiguate a useful subset of the LaTex syntax.