Lossless Syntax Tree Pattern - oilshell/oil GitHub Wiki

Feel free to add more resources to this page or clean it up!

Hacker News Comments on "From AST to Lossless Syntax Tree"

Other Names

RedBaron: Full Syntax Tree
Go: AST
- Issue 20744: Free-floating comments are single-biggest issue when manipulating the AST
JavaScript: AST or CST
Roslyn: Syntax Tree -- makes the explicit claim of being round-trippable
- Red-Green Trees? This term is more about how to handle insertions in an IDE, which change the position of every node after the insertion. Red tree is a wrapper around the green tree.

IDE Architecture

Roslyn -- Microsoft's C# compiler platform
- Anders Hejlsberg on Modern Compiler Construction
JetBrains Implementing Parser and PSI -- The AST nodes have a direct mapping to text ranges in the underlying document. The bottom-most nodes of the AST match individual tokens returned by the lexer, and higher level nodes match multiple-token fragments. Operations performed on nodes of the AST tree, such as inserting, removing, reordering nodes and so on, are immediately reflected as changes to the text of the underlying document.
- There is currently no ready way to reuse existing language grammars, for example, from ANTLR, for creating custom language parsers. The parsers need to be coded manually. -- this should go in my meta-language rant
- Custom language parser and PSI classes can be generated from grammars using Grammar-Kit plugin. Besides code generation it provides various features for editing grammar files: syntax highlighting, quick navigation, refactorings and more. The Grammar-Kit plugin is built using its own engine and its source code can be found on GitHub.
- Comma, an IntelliJ-Based IDE for Raku (YouTube). Nice video with concrete experiences. He talks about PSI nodes, how they are memory intensive and have stubs, etc.
Making Tools Kythe-Compatible -- I would have thought that Kythe had the LST concept, because it's basically an IDE, but I don't see it. I guess that is because it's read-only?

Lossless Representations

cst for JavaScript -- CST means Concrete Syntax Tree. Unlike an AST (Abstract Syntax Tree), a CST contains all the information from the JavaScript source file: whitespace, punctuators, comments. This information is extremely useful for code style checkers and other code linters. CST is also useful for cases when you need to apply modifications to existing JavaScript files while preserving the initial file formatting.
JRuby Parser -- JRuby once had a parser which kept track of all sorts of extra information when it built it's Abstract Syntax Tree (AST). Stuff like character offsets where a particular element started or ended. The impact of this extra information was a more than noticeable amount of memory and a bit of a perf impact. At the time we decided to discontinue having this sort of parser in JRuby we created JRubyParser.* -- related to Parsing Is Difficult
Swift libSyntax (thanks to CAD1997 )

Currently, lib/AST structures don't make a very clear distinction between syntactic and semantic information. Long term, we hope to achieve the following based on work here: In no particular order, here is a summary of the design and implementation points for this library:

Represent Swift source with "full fidelity" - parsing a source file and printing the syntax tree should result in the same file.

Provide good structured editing APIs at all granularities.

Accommodate "bad syntax" - humans are imperfect and source code is constantly in a state of flux in an editor.

Roslyn Tree Optimizations, deduplication is a surprisingly big one
For configuration files:
- https://kev.inburke.com/kevin/more-comment-preserving-configuration-parsers/
- https://github.com/kevinburke/ssh_config
  - This is the second comment-preserving configuration parser I've written, after an /etc/hosts parser. Eventually, I will write one for every Linux file format.

Style-Preserving Source Translators

2to3 -- for Python 2 to Python 3
RedBaron -- RedBaron, by relying on Baron, uses a Full Syntax Tree (FST). It’s like an AST except it keeps every information, included formatting, and is then a lossless representation of the source code.
Facebook's jscodeshift -- uses the recast library
- recast - JavaScript syntax tree transformer, nondestructive pretty-printer, and automatic source map generator
Facebook's pfff -- Tools for code analysis, visualizations, or style-preserving source transformation
- Facebook's Spatch for PHP, based on the pfff toolkit
External Clang Examples -- has clang-mutate, but it's dormant

Source Reformatters

TODO: What data structures do these use?

clang-format -- Clang AST, which is huge
- http://clang.llvm.org/docs/LibFormat.html -- design is undocumented
gofmt -- uses the Go AST package, which is reputed to be messy
dartfmt -- The Hardest Program I've Ever Written
yapf for Python
scalafmt
prettier for JavaScript
rustfmt Design -- Blog Post
dfmt First uses a parsers to generate lists of locations with special meanings. Then goes through a lossless token stream to format.
FemtoCleaner - A bot to automatically upgrade your Julia syntax (August 2017)
- Unfortunately, while julia’s parser is accessible from within the language and can be used to find these instances of deprecated syntax, it cannot be used for our purposes. This is because it does not support our second goal - style preservation. -- Parsing Is Difficult
- There are several names of this concept, “round-tripable representation”, “Concrete Syntax Tree (CST)” or “Lossless Syntax Tree” being perhaps the most common. Luckily, in the Julia ecosystem we have not one, but two choices for such a parser

Other Docs / Designs / Discussions

Haskell GHC AST Annotations -- Simplify the roundtripping and modification of source by explicitly capturing the missing location information for the syntactic markers
What to do about comments? on Lambda the Ultimate
https://news.ycombinator.com/item?id=13914218 -- From C# compiler dev -- Resilient parsing. This is the big one! If you give our parser a string that is illegal according to the grammar, our parser will still give you a syntax tree! (We'll also spit errors out). But getting a syntax tree regardless of the actual validity of the program being passed in means that the IDE can give autocomplete and report type-checking error messages. As an example, the code "var x = velocity." is invalid C#. However, in order to give autocomplete on "velocity", that code needs to be parsed into an AST, and then typechecked, and then we can extract the members on the type in order to provide a good user experience.

Transpilers

These don't need to use the Lossless Syntax Tree because the resulting code won't be edited by a human.

CoffeeScript, etc.