Inline Parsing - gmarpons/asciidoc-hs GitHub Wiki

Grammar

Grammar ℑ for AsciiDoc inlines

⟨ 𝐢𝐧𝐥𝐢𝐧𝐞𝐬 ⟩	I	→	F N_u^★
⟨ 𝐮𝐧𝐜𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐞𝐝 ⟩_m	U	→	L^？ ϕ m_u^◃ ( G Y^？｜ F ) N_u^★ m_u^▹
⟨ 𝐜𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐞𝐝 ⟩_m	C	→	L^？ φ m_c^◃ F N_c^★ ω m_c^▹
⟨ 𝐟𝐢𝐫𝐬𝐭 ⟩	F	→	A ｜ π X
⟨ 𝐧𝐞𝐱𝐭_u ⟩	N_u	→	G Y^？｜ σ ( U / O ) ( A ｜ μ X )^？
⟨ 𝐧𝐞𝐱𝐭_c ⟩	N_c	→	G Y ｜ μ ( U / O ) ( A ｜ μ X )^？
⟨ 𝐱 ⟩	X	→	U ( A ｜ μ X )^？ / C ( G Y ｜ μ X )^？ / O ( A ｜ μ X )^？
⟨ 𝐲 ⟩	Y	→	A ｜ σ X
⟨ 𝐚𝐭𝐭𝐫𝐢𝐛𝐮𝐭𝐞-𝐥𝐢𝐬𝐭 ⟩	L	→	‘`[`’ ⋯ ‘`]`’
⟨ 𝐧𝐞𝐰𝐥𝐢𝐧𝐞 ⟩	N	→	`CR` ｜ `CR` `LF` ｜ `LF`
⟨ 𝐦𝐚𝐫𝐤_u ⟩	m_u^◃, m_u^▹	→	`**` ｜ `__` ｜ `##` ｜ ⋯
⟨ 𝐦𝐚𝐫𝐤_c ⟩	m_c^◃, m_c^▹	→	`*` ｜ `_` ｜ `#` ｜ ⋯
⟨ 𝐠𝐚𝐩 ⟩	G	=	Longest sequence of ( S ｜ N )
⟨ 𝐚𝐥𝐩𝐡𝐚𝐧𝐮𝐦 ⟩	A	=	Longest sequence of ｛ c ｜ isAlphaNum c ｝
⟨ 𝐬𝐩𝐚𝐜𝐞 ⟩	S	=	Longest sequence of ｛ c ｜ isSpace c AND c ≠ `CR` AND c ≠ `LF` ｝
⟨ 𝐨𝐭𝐡𝐞𝐫 ⟩	O	=	｛ c ｜ NOT ( isSpace c OR isAlphaNum c ) ｝

Interpretation of special symbols:

I is the start symbol. A successful parsing only occurs if the whole input is consumed.
‘|’ is conventional EBNF alternation.
‘/’ is like ordered choice in PEGs. We only use this operator between sequences starting with U, C, and O, and that one is always the priority among them.
Parsing is carried out with a supporting data structure: a LIFO stack called openEnclosures. It is used to disambiguate and support context-sensitive parsing.
For a given U rule instantiation, m_u^◃ is somthing like {x ↤ m_u ; push(x, openEnclosures} and m_u^▹ is {m_u ; pop(openEnclosures)}, both m_u are equal. I.e., m_u^◃ and m_u^▹ not only recognize the (same) token mark, but also update the openEnclosures stack. Idem for C rule instantiation.
Symbols ϕ, φ, ω, π, σ, and μ are predicate placeholders that, in combination with the openEnclosures stack, can be used to disambiguate the grammar. They do not consume any input, only check that certain conditions are met and fail or succeed accordingly. Different high-level disambiguation rules can be implemented.

Example 1. Implementation of a rule ‘Cannot nest two enclosures with identical mark’

Predicates ϕ and φ both need to use lookahead and fail if the found mark is already present (on top of the stack or otherwise) in openEnclosures.

Grammar properties

We classify input tokens in three mutually exclusive classes:

Class a for alphanumeric characters.
Class g for space-like characters, including newlines.
Class o for any other character, including punctuation symbols and formatting marks.

The following theorem ensures that any finite sequence of at least one element not starting with a space-like character can be parsed as an inline, and that the constraints of constrained formatting pairs are preserved.

Theorem 1

Assuming that the following conditions are met:

ϕ, φ, ω, π, σ, and μ can only fail if the next item in the input is an o,

grammar ℑ has the following properties:

All non-terminals can be reached from I.
1. X can always consume any new o that is left in the input.
2. After consuming all o's, X can always consume any new a that is left in the input.
3. X can only be preceded by g or o.
Y can only be preceded by g.
1. F can always consume any new o that is left in the input.
2. F can always consume any new a that is left in the input.
1. N_u and N_c can always consume any new o that is left in the input.
2. N_u and N_c can always consume any new a that is left in the input.
1. I can only start with a or o.
2. I can end with any of a, g, or o.
1. C can only be preceded by g or o, or be at input start.
2. C can only be followed by g, o, or EOF.
All g and o can be followed by a C, and C can be at input start.
1. m_c^◃ can only be followed by a or o.
2. m_c^▹ can only be preceded by a or o.
1. Any a can be followed by any of A, G, O, U, or EOF.
2. Any g that is not inside a C can be followed by any of A, C, G, O, U, and also EOF if it is outside C.
1. Any o can be followed by any of C, O, or U and,
2. if not part of m^◃, can also be followed by EOF and,
3. if not part of m_c^◃, can also be followed by G and,
4. if not part of m_c^▹, can also be followed by A.

Theorem 2

Grammar ℑ has no left-recursion.

We need a third theorem stating that, given certain properties of ϕ, φ, etc., grammar ℑ is non-ambiguous.

Proof of Theorem 1

TODO

Inline Parsing - gmarpons/asciidoc-hs GitHub Wiki

Grammar

Grammar properties

Proof of Theorem 1

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️