Context free grammars - piotr-yuxuan/PoC GitHub Wiki
Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages, conforming to the rules of a formal grammar.
Nowadays, chess can be written down in PGN: the Portable Game Notation (PGN) is a plain text computer-processible format for recording chess games (both the moves and related data). A specification can be found in the bibliography and many examples can be found throughout the internet, including Wikipedia.
I want to be able to parse PGN is it's the de facto standard and let me access an immense chess corpus. For the sake of clarity, I've chosen to describe this notation with a context-free grammar.
A context-free grammar is a grammar in which the left-hand side of each production rule consists of only a single nonterminal symbol. This restriction is non-trivial; not all languages can be generated by context-free grammars. Those that can are called context-free languages.
Here's a typical example of a context-free grammar one might see in a textbook on automata or parsing. It is a common convention in many textbooks to use the capital letter S to indicate the starting rule, so for this example, we'll follow that convention:
S = AB*
AB = A B
A = 'a'+
B = 'b'+
This is a context-free grammar because AB
is one single token, not the same as A B
. This grammar looks for sequences made of one or more 'a' followed by one or more 'b'
With instaparse
generating a parser out of this grammar is very easy:
(def as-and-bs
(i/parser
"S = AB*
AB = A B
A = 'a'+
B = 'b'+"))
;; => (as-and-bs "abbaaaabb")
[:S [:AB [:A "a"] [:B "b" "b"]] [:AB [:A "a" "a" "a" "a"] [:B "b" "b"]]]
;; => (as-and-bs "abba") ;; any 'a' must be followed by a 'b'.
;; Parse error at line 1, column 5:
;; abba
;; ^
;; Expected one of:
;; "b"
;; "a"
However, our job is not to parse a-and-b-made string but a real game record. Let's take this one:
[Event "F/S Return Match"]
[Site "Belgrade, Serbia Yugoslavia|JUG"]
[Date "1992.11.04"]
[Round "29"]
[White "Fischer, Robert J."]
[Black "Spassky, Boris V."]
[Result "1/2-1/2"]
1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 {This opening is called the Ruy Lopez.}
4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 d6 8. c3 O-O 9. h3 Nb8 10. d4 Nbd7
11. c4 c6 12. cxb5 axb5 13. Nc3 Bb7 14. Bg5 b4 15. Nb1 h6 16. Bh4 c5 17. dxe5
Nxe4 18. Bxe7 Qxe7 19. exd6 Qf6 20. Nbd2 Nxd6 21. Nc4 Nxc4 22. Bxc4 Nb6
23. Ne5 Rae8 24. Bxf7+ Rxf7 25. Nxf7 Rxe1+ 26. Qxe1 Kxf7 27. Qe3 Qg5 28. Qxg5
hxg5 29. b3 Ke6 30. a3 Kd6 31. axb4 cxb4 32. Ra5 Nd5 33. f3 Bc8 34. Kf2 Bf5
35. Ra7 g6 36. Ra6+ Kc5 37. Ke1 Nf4 38. g3 Nxh3 39. Kd2 Kb5 40. Rd6 Kc5 41. Ra6
Nf2 42. g4 Bd3 43. Re6 1/2-1/2
My parser is four-fold in tokeniser, presenter, translator and reducer. Parsing a PGN record can obviously be made in one step but I want this work to stay simple and straightforward.
This four-layered parser is produced by a factory function. I explain the meaning of each argument below.
(defn parser-factory
"A string is taken as input by the tokeniser which returns nested
vectors of tokens. The presenter returns a more convenient data
structure. The translator change the position tokens to something
understandable by the program."
[tokeniser presenter translator reducer]
(fn [string] (-> string
tokeniser
presenter
translator
reducer)))
The main goal of this step is to break down the long string we get into intelligible tokens we can later deal with. The grammar of this step also enforces some PGN standard rules: every record should start with some precise tags and so on.
When the tokeniser for PGN is given the previous string, it outputs nested vectors. The first item of any vector is the type of the token found, and inner vectors are its content.
[:PGN
[:Tags
[:Event "F/S Return Match"]
[:Site "Belgrade, Serbia Yugoslavia|JUG"]
[:Date "1992.11.04"]
[:Round "29"]
[:White "Fischer, Robert J."]
[:Black "Spassky, Boris V."]
[:Result "\"1/2-1/2\""]]
[:Movetext
[:Token "e4"]
[:Token "e5"]
[:Token "Nf3"]
[:Token "Nc6"]
[:Token "Bb5"]
[:Token "a6"]
[:Token "Ba4"]
[:Token "Nf6"]
[:Token "O-O"]
[:Token "Be7"]
;; (... truncated ...)
[:Token "Nf2"]
[:Token "g4"]
[:Token "Bd3"]
[:Token "Re6"]]]
We will next get focused in the :Movetext
node which contains what is called here the history.
Once we know what we are working on, the presenter is in charge of getting rid of anything not useful. As all (classical) chess games start from the same initial state, we can reply the history from this state.
This step is two-fold: first it refactors its input into a map and then it keep only what we need, hence the history:
;; full output here
'("e4" "e5" "Nf3" "Nc6" "Bb5" "a6" "Ba4" "Nf6" "O-O" "Be7" "Re1" "b5" "Bb3" "d6" "c3" "O-O" "h3" "Nb8" "d4" "Nbd7" "c4" "c6" "cxb5" "axb5" "Nc3" "Bb7" "Bg5" "b4" "Nb1" "h6" "Bh4" "c5" "dxe5" "Nxe4" "Bxe7" "Qxe7" "exd6" "Qf6" "Nbd2" "Nxd6" "Nc4" "Nxc4" "Bxc4" "Nb6" "Ne5" "Rae8" "Bxf7+" "Rxf7" "Nxf7" "Rxe1+" "Qxe1" "Kxf7" "Qe3" "Qg5" "Qxg5" "hxg5" "b3" "Ke6" "a3" "Kd6" "axb4" "cxb4" "Ra5" "Nd5" "f3" "Bc8" "Kf2" "Bf5" "Ra7" "g6" "Ra6+" "Kc5" "Ke1" "Nf4" "g3" "Nxh3" "Kd2" "Kb5" "Rd6" "Kc5" "Ra6" "Nf2" "g4" "Bd3" "Re6")
What is meant by translation? At that point, the parser has all the data it needs and only it. It gets this data in a list of strings. The translator is mapped onto each element of that list to turn it into a list of atoms. Each atom describes one move.
(([:L "e"] [:F "4"])
([:L "e"] [:F "5"])
([:Piece "N"] [:L "f"] [:F "3"])
([:Piece "N"] [:L "c"] [:F "6"])
([:Piece "B"] [:L "b"] [:F "5"])
([:L "a"] [:F "6"])
([:Piece "B"] [:L "a"] [:F "4"])
([:Piece "N"] [:L "f"] [:F "6"])
([:Castling [:KingSide]])
([:Piece "B"] [:L "e"] [:F "7"])
;; (... truncated ...)
([:Piece "N"] [:L "f"] [:F "2"])
([:L "g"] [:F "4"])
([:Piece "B"] [:L "d"] [:F "3"])
([:Piece "R"] [:L "e"] [:F "6"]))
An atom can contain some keywords:
-
:L
refers to the destination line; -
:F
refers to the destination file; -
:Piece
give a hint about which piece is going to move; - Capture or special moves are also described.
The reducer can't process the translated data alone: each atom only explicitly contains the position a piece is goint to be moved to. It has to infer on which piece eventually move then it needs to the current state at each move. As it processes several atoms and output one single state, this function is a reducer.
This step is actually two-fold:
- First, an atomic reducer takes each atom and turns it into an explicit move. For example, the first atom
([:L "e"] [:F "4"])
eventually becomes[[5 2] [5 4]]
: the pawn in[5 2]
(= e2) moves two positions forward. - Once a atom has been resolved into an explicit move, the state reducer will perform it on the current state, starting from the classical initial state.
The output of this last step is the final state of the game:
{:player :black,
:history
[[[5 2] [5 4]]
[[5 7] [5 5]]
[[7 1] [6 3]]
[[2 8] [3 6]]
[[6 1] [2 5]]
[[1 7] [1 6]]
[[2 5] [1 4]]
[[7 8] [6 6]]
{:Castling :KingSide}
[[6 8] [5 7]]
[[6 1] [5 1]]
[[2 7] [2 5]]
[[1 4] [2 3]]
[[4 7] [4 6]]
[[3 2] [3 3]]
{:Castling :KingSide}
[[8 2] [8 3]]
;; (... truncated ...)
[[1 6] [5 6]]],
:positions
{:white
((king [5 1] [7 1] [6 2] [5 1] [4 2])
(queen [4 1] [5 1] [5 3] [7 5] nil)
(rook [1 1] [1 5] [1 7] [1 6] [4 6] [1 6] [5 6])
(rook [8 1] [6 1] [5 1] nil)
(bishop [3 1] [7 5] [8 4] [5 7] nil)
(bishop [6 1] [2 5] [1 4] [2 3] [3 4] [6 7] nil)
(knight [2 1] [3 3] [2 1] [4 2] [3 4] nil)
(knight [7 1] [6 3] [5 5] [6 7] nil)
(pawn [1 2] [1 3] [2 4] nil)
(pawn [2 2] [2 3])
(pawn [3 2] [3 3] [3 4] [2 5] nil)
(pawn [4 2] [4 4] [5 5] [4 6] nil)
(pawn [5 2] [5 4] nil)
(pawn [6 2] [6 3])
(pawn [7 2] [7 3] [7 4])
(pawn [8 2] [8 3] nil)),
:black
((king [5 8] [7 8] [6 7] [5 6] [4 6] [3 5] [2 5] [3 5])
(queen [4 8] [5 7] [6 6] [7 5] nil)
(rook [1 8] [5 8] [5 1] nil)
(rook [8 8] [6 8] [6 7] nil)
(bishop [3 8] [2 7] [3 8] [6 5] [4 3])
(bishop [6 8] [5 7] nil)
(knight [2 8] [3 6] [2 8] [4 7] [2 6] [4 5] [6 4] [8 3] [6 2])
(knight [7 8] [6 6] [5 4] [4 6] [3 4] nil)
(pawn [1 7] [1 6] [2 5] [2 4] nil)
(pawn [2 7] [2 5] nil)
(pawn [3 7] [3 6] [3 5] [2 4])
(pawn [4 7] [4 6] nil)
(pawn [5 7] [5 5] nil)
(pawn [6 7] nil)
(pawn [7 7] [7 6])
(pawn [8 7] [8 6] [7 5]))}}