Inline Parsing - gmarpons/asciidoc-hs GitHub Wiki
⟨ 𝐢𝐧𝐥𝐢𝐧𝐞𝐬 ⟩ |
I |
→ |
F Nu★ |
⟨ 𝐮𝐧𝐜𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐞𝐝 ⟩m |
U |
→ |
L? ϕ mu◃ ( G Y? | F ) Nu★ mu▹ |
⟨ 𝐜𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐞𝐝 ⟩m |
C |
→ |
L? φ mc◃ F Nc★ ω mc▹ |
⟨ 𝐟𝐢𝐫𝐬𝐭 ⟩ |
F |
→ |
A | π X |
⟨ 𝐧𝐞𝐱𝐭u ⟩ |
Nu |
→ |
G Y ? | σ ( U / O ) ( A | μ X ) ? |
⟨ 𝐧𝐞𝐱𝐭c ⟩ |
Nc |
→ |
G Y | μ ( U / O ) ( A | μ X ) ? |
⟨ 𝐱 ⟩ |
X |
→ |
U ( A | μ X ) ? / C ( G Y | μ X ) ? / O ( A | μ X ) ? |
⟨ 𝐲 ⟩ |
Y |
→ |
A | σ X |
⟨ 𝐚𝐭𝐭𝐫𝐢𝐛𝐮𝐭𝐞-𝐥𝐢𝐬𝐭 ⟩ |
L |
→ |
‘ |
⟨ 𝐧𝐞𝐰𝐥𝐢𝐧𝐞 ⟩ |
N |
→ |
|
⟨ 𝐦𝐚𝐫𝐤u ⟩ |
mu◃, mu▹ |
→ |
|
⟨ 𝐦𝐚𝐫𝐤c ⟩ |
mc◃, mc▹ |
→ |
|
⟨ 𝐠𝐚𝐩 ⟩ |
G |
= |
Longest sequence of ( S | N ) |
⟨ 𝐚𝐥𝐩𝐡𝐚𝐧𝐮𝐦 ⟩ |
A |
= |
Longest sequence of { c | isAlphaNum c } |
⟨ 𝐬𝐩𝐚𝐜𝐞 ⟩ |
S |
= |
Longest sequence of { c | isSpace c AND c ≠ |
⟨ 𝐨𝐭𝐡𝐞𝐫 ⟩ |
O |
= |
{ c | NOT ( isSpace c OR isAlphaNum c ) } |
Interpretation of special symbols:
-
I is the start symbol. A successful parsing only occurs if the whole input is consumed.
-
‘|’ is conventional EBNF alternation.
-
‘/’ is like ordered choice in PEGs. We only use this operator between sequences starting with U, C, and O, and that one is always the priority among them.
-
Parsing is carried out with a supporting data structure: a LIFO stack called
openEnclosures
. It is used to disambiguate and support context-sensitive parsing. -
For a given U rule instantiation, mu◃ is somthing like
{x ↤ mu ; push(x, openEnclosures}
and mu▹ is{mu ; pop(openEnclosures)}
, both mu are equal. I.e., mu◃ and mu▹ not only recognize the (same) token mark, but also update theopenEnclosures
stack. Idem for C rule instantiation. -
Symbols ϕ, φ, ω, π, σ, and μ are predicate placeholders that, in combination with the
openEnclosures
stack, can be used to disambiguate the grammar. They do not consume any input, only check that certain conditions are met and fail or succeed accordingly. Different high-level disambiguation rules can be implemented.Example 1. Implementation of a rule ‘Cannot nest two enclosures with identical mark’Predicates ϕ and φ both need to use lookahead and fail if the found mark is already present (on top of the stack or otherwise) in
openEnclosures
.
We classify input tokens in three mutually exclusive classes:
-
Class a for alphanumeric characters.
-
Class g for space-like characters, including newlines.
-
Class o for any other character, including punctuation symbols and formatting marks.
The following theorem ensures that any finite sequence of at least one element not starting with a space-like character can be parsed as an inline, and that the constraints of constrained formatting pairs are preserved.
Assuming that the following conditions are met:
-
ϕ, φ, ω, π, σ, and μ can only fail if the next item in the input is an o,
grammar ℑ has the following properties:
-
All non-terminals can be reached from I.
-
-
X can always consume any new o that is left in the input.
-
After consuming all o's, X can always consume any new a that is left in the input.
-
X can only be preceded by g or o.
-
-
Y can only be preceded by g.
-
-
F can always consume any new o that is left in the input.
-
F can always consume any new a that is left in the input.
-
-
-
Nu and Nc can always consume any new o that is left in the input.
-
Nu and Nc can always consume any new a that is left in the input.
-
-
-
I can only start with a or o.
-
I can end with any of a, g, or o.
-
-
-
C can only be preceded by g or o, or be at input start.
-
C can only be followed by g, o, or
EOF
.
-
-
All g and o can be followed by a C, and C can be at input start.
-
-
mc◃ can only be followed by a or o.
-
mc▹ can only be preceded by a or o.
-
-
-
Any a can be followed by any of A, G, O, U, or
EOF
. -
Any g that is not inside a C can be followed by any of A, C, G, O, U, and also
EOF
if it is outside C.
-
-
-
Any o can be followed by any of C, O, or U and,
-
if not part of m◃, can also be followed by
EOF
and, -
if not part of mc◃, can also be followed by G and,
-
if not part of mc▹, can also be followed by A.
-
Grammar ℑ has no left-recursion.
We need a third theorem stating that, given certain properties of ϕ, φ, etc., grammar ℑ is non-ambiguous.