Summarization of selected references and their input processing part in paper "Syntax Errors" - sagr4019/ResearchProject GitHub Wiki
This content gets copy pasted into the Syntax Errors article after it has been created.
- Fixed 27% programs completely and 19% partially using a multi-layered sequence-to-sequence neural network "with attention"
- Encoder RNN for inputs
- Decoder RNN for outputs
- Generated a fixed-sized pool of names
- Mapped each identifier (variable or function name) to a name from the pool
- Mapped each literal to a special token (Ints -> NUM, strings -> STR)
- Used a special token at the end of a token sequence
Considering the target sequence size is difficult. They encoded line numbers in the program representation where a statement S at a line L is represented by (l, s), where l and s are tokenizations of L and S. A program is k lines is represented as (l1, s1) ... (lk, sk) with l1, ..., lk line numbers and s1, ..., sk token sequences. A single output fix consists of a line number li and an associated statement s'i that fixes the statement si.
This results in a much smaller output (compared to the entire sequence), might be easier to predict.