Inline Parsing - gmarpons/asciidoc-hs Wiki

🚨 github-wiki-see.page does not render asciidoc. Source for crawling below. Please visit the Original URL! 🚨


== Grammar

// TODO. Haskell definition has a different definition for <y> (it integrates the G), but we like this one better

// We use http://xahlee.info/comp/unicode_math_font.html for formatting the first column.
.Grammar ℑ for AsciiDoc inlines
****
[cols="25,~,~,60%", frame=none, grid=none]
|===
| ⟨ 𝐢𝐧𝐥𝐢𝐧𝐞𝐬 ⟩ | _I_ | -> |  _F_{nbsp} _N~u~_^★^{nbsp}
| ⟨ 𝐮𝐧𝐜𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐞𝐝 ⟩~m~ | _U_ | -> |  _L_^*?*^ _ϕ_{nbsp} m~_u_~^◃^{nbsp} ( _G_ _Y_^*?*^ | _F_ ){nbsp} _N~u~_^★^{nbsp} m~_u_~^▹^
| ⟨ 𝐜𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐞𝐝 ⟩~m~ | _C_ | -> | _L_^*?*^ _φ_{nbsp} m~c~^◃^{nbsp} _F_{nbsp} _N~c~_^★^ _ω_{nbsp} m~c~^▹^
| ⟨ 𝐟𝐢𝐫𝐬𝐭 ⟩ | _F_ | -> | _A_ | _π_{nbsp} _X_
| ⟨ 𝐧𝐞𝐱𝐭~_u_~ ⟩ | _N_~_u_~ | -> |  _G_{nbsp} _Y_^*{nbsp}?*^ | _σ_{nbsp} ( _U_{nbsp} /{nbsp} _O_ ){nbsp} ( _A_ | _μ_{nbsp}  _X_ )^*{nbsp}?*^
| ⟨ 𝐧𝐞𝐱𝐭~_c_~ ⟩ | _N_~_c_~ | -> | _G_{nbsp} _Y_ | _μ_{nbsp} ( _U_{nbsp} /{nbsp} _O_ ){nbsp} ( _A_ | _μ_{nbsp}  _X_ )^*{nbsp}?*^
| ⟨ 𝐱 ⟩ | _X_ | -> | _U_{nbsp} ( _A_ | _μ_{nbsp} _X_ )^*{nbsp}?*^{nbsp} /{nbsp} _C_{nbsp} ( _G_{nbsp} _Y_ | _μ_{nbsp} _X_ )^*{nbsp}?*^{nbsp} /{nbsp} _O_{nbsp} ( _A_ | _μ_{nbsp} _X_ )^*{nbsp}?*^
| ⟨ 𝐲 ⟩ | _Y_ | -> | _A_ | _σ_{nbsp} _X_
// | ⟨ 𝐞𝐬𝐜𝐚𝐩𝐞𝐝 ⟩ | _E_ | -> |
| ⟨ 𝐚𝐭𝐭𝐫𝐢𝐛𝐮𝐭𝐞-𝐥𝐢𝐬𝐭 ⟩ | _L_ | -> | '`{blank}``[``{blank}`' ⋯ '`{blank}``]``{blank}`'
| ⟨ 𝐧𝐞𝐰𝐥𝐢𝐧𝐞 ⟩ | _N_ | -> | `CR` | `CR` `LF` | `LF`
| ⟨ 𝐦𝐚𝐫𝐤~_u_~ ⟩ | m~_u_~^◃^, m~_u_~^▹^ | -> | `**` | `+__+` | `##` | ⋯
| ⟨ 𝐦𝐚𝐫𝐤~_c_~ ⟩ | m~_c_~^◃^, m~_c_~^▹^ | -> | `*` | `+_+` | `#` | ⋯
| ⟨ 𝐠𝐚𝐩 ⟩ | _G_ | = | Longest sequence of ( _S_ | _N_ )
| ⟨ 𝐚𝐥𝐩𝐡𝐚𝐧𝐮𝐦 ⟩ | _A_ | = | Longest sequence of { c | isAlphaNum c }
| ⟨ 𝐬𝐩𝐚𝐜𝐞 ⟩ | _S_ | = | Longest sequence of { c | isSpace c{nbsp} AND c{nbsp} ≠ `CR`{nbsp} AND c{nbsp} ≠ `LF` }
| ⟨ 𝐨𝐭𝐡𝐞𝐫 ⟩ | _O_ | = | { c | NOT{nbsp} ( isSpace c OR isAlphaNum c ) }
|===
****

Interpretation of special symbols:

* _I_ is the start symbol.
*A successful parsing only occurs if the whole input is consumed*.
* '`|`' is conventional EBNF alternation.
* '`/`' is like ordered choice in PEGs.
We only use this operator between sequences starting with _U_, _C_, and _O_, and that one is always the priority among them.
* Parsing is carried out with a supporting data structure: a LIFO stack called `openEnclosures`.
It is used to disambiguate and support context-sensitive parsing.
* For a given _U_ rule instantiation, m~_u_~^◃^{nbsp} is somthing like `{x ↤ m~u~ ; push(x, openEnclosures}` and m~_u_~^▹^{nbsp} is `{m~u~ ; pop(openEnclosures)}`, both m~_u_~{nbsp} are equal.
I.e., m~_u_~^◃^{nbsp} and m~_u_~^▹^{nbsp} not only recognize the (same) token mark, but also update the `openEnclosures` stack.
Idem for _C_ rule instantiation.
* Symbols _ϕ_, _φ_, _ω_, _π_, _σ_, and _μ_ are predicate placeholders that, in combination with the `openEnclosures` stack, can be used to disambiguate the grammar.
They do not consume any input, only check that certain conditions are met and fail or succeed accordingly.
Different high-level disambiguation rules can be implemented.
+
.Implementation of a rule '`*Cannot nest two enclosures with identical mark*`'
====
Predicates _ϕ_ and _φ_ both need to use lookahead and fail if the found mark is already present (on top of the stack or otherwise) in `openEnclosures`.
====

== Grammar properties

We classify input tokens in three mutually exclusive classes:

. Class _a_ for alphanumeric characters.
. Class _g_ for space-like characters, including newlines.
. Class _o_ for any other character, including punctuation symbols and formatting marks.

The following theorem ensures that any finite sequence of at least one element not starting with a space-like character can be parsed as an inline, and that the constraints of https://docs.asciidoctor.org/asciidoc/latest/text/#constrained[constrained formatting pairs] are preserved.

.*Theorem 1*
****
Assuming that the following conditions are met:

a. _ϕ_, _φ_, _ω_, _π_, _σ_, and _μ_ can only fail if the next item in the input is an _o_,

grammar ℑ has the following properties:

[start=0]
0. All non-terminals can be reached from _I_.

1. {blank}
+
--
a. _X_ can always consume any new _o_ that is left in the input.
b. After consuming all _o_'s, _X_ can always consume any new _a_ that is left in the input.
c. _X_ can only be preceded by _g_ or _o_.
--

2. _Y_ can only be preceded by _g_.

3. {blank}
+
--
a. _F_ can always consume any new _o_ that is left in the input.
b. _F_ can always consume any new _a_ that is left in the input.
--

4. {blank}
+
--
a. _N~u~_ and _N~c~_ can always consume any new _o_ that is left in the input.
b. _N~u~_ and _N~c~_ can always consume any new _a_ that is left in the input.
--

5. {blank}
+
--
a. _I_ can only start with _a_ or _o_.
b. _I_ can end with any of _a_, _g_, or _o_.
--

6. {blank}
+
--
a. _C_ can only be preceded by _g_ or _o_, or be at input start.
b. _C_ can only be followed by _g_, _o_, or `EOF`.
--

7. All _g_ and _o_ can be followed by a _C_, and _C_ can be at input start.

8. {blank}
+
--
a. m~_c_~^◃^{nbsp} can only be followed by _a_ or _o_.
b. m~_c_~^▹^{nbsp} can only be preceded by _a_ or _o_.
--

9. {blank}
+
--
a. Any _a_ can be followed by any of _A_, _G_, _O_, _U_, or `EOF`.
b. Any _g_ that is not inside a _C_ can be followed by any of _A_, _C_, _G_, _O_, _U_, and also `EOF` if it is outside _C_.
--
// The properties above use that _A_ and _G_ have longest sequence rules.

10. {blank}
+
--
a. Any _o_ can be followed by any of _C_, _O_, or _U_ and,
b. if not part of m^◃^, can also be followed by `EOF` and,
c. if not part of m~_c_~^◃^, can also be followed by _G_ and,
d. if not part of m~_c_~^▹^, can also be followed by _A_.
--
****

.*Theorem 2*
****
Grammar ℑ has no left-recursion.
****

We need a third theorem stating that, given certain properties of _ϕ_, _φ_, etc., grammar ℑ is non-ambiguous.

=== Proof of Theorem 1

1. TODO