03 abstract syntax tree - zestones/Argonaut GitHub Wiki
The Abstract Syntax Tree (AST) is a pivotal data structure in the Argonaut Compiler, representing the syntactic and semantic structure of a source program. It is constructed during the parsing phase and is subsequently used for semantic analysis and code generation. Below, we provide a comprehensive guide on how the AST is constructed and managed in the Argonaut Compiler.
The AST is a hierarchical tree structure where each node represents a construct occurring in the source code, such as a variable declaration, a function call, or an operator. It provides a higher-level abstraction than the raw token stream produced by the lexical analysis.
- Nodes: Each node in the AST corresponds to a syntactic construct.
-
Node Types: Each node has a specific type (e.g.,
A_VARIABLE_DECLARATION
,A_FUNCTION_CALL
) indicating its role. - Child-Sibling Representation: The AST uses the Left-Child Right-Sibling (LCRS) representation to allow arbitrary branching while maintaining efficient traversal.
- Symbol Tables: Nodes may reference entries in the symbol tables (e.g., declaration table, lexeme table) for semantic information.
Each node in the AST is defined as follows:
typedef struct Node {
struct Node* child; // Pointer to the left child
struct Node* sibling; // Pointer to the right sibling
NodeType type; // Node type (operator, variable, literal, etc.)
int index_lexicographic; // Index in the lexicographic table
int index_declaration; // Index in the declaration table
} Node;
We defined a flexible mechanism for managing and enumerating node types in a syntax tree, commonly used in compilers or interpreters. Here's a breakdown of its functionality:
The FOREACH_NODE_TYPE
macro lists all node types. Each node type represents a specific construct in a programming language (e.g., literals, operations, control flow, declarations).
Example:
NODE(A_INTEGER_LITERAL)
NODE(A_ADD_OP)
NODE(A_IF)
This single macro acts as a central repository for all node types, avoiding repetitive definitions and making it easy to add or modify types.
The FOREACH_NODE_TYPE
macro is used to generate the values of the NodeType
enum:
#define GENERATE_ENUM(ENUM) ENUM,
typedef enum {
FOREACH_NODE_TYPE(GENERATE_ENUM)
} NodeType;
This translates to:
typedef enum {
A_INTEGER_LITERAL,
A_FLOAT_LITERAL,
A_BOOLEAN_LITERAL,
...
} NodeType;
The benefit:
- Adding or removing a node type requires editing only the
FOREACH_NODE_TYPE
macro. - No need to manually update the
NodeType
enum.
The NodeTypeStrings
array is generated using another macro:
#define GENERATE_STRING(STRING) #STRING,
static const char *NodeTypeStrings[] = {
FOREACH_NODE_TYPE(GENERATE_STRING)
};
This converts node types like A_INTEGER_LITERAL
into readable strings ("A_INTEGER_LITERAL"
) for debugging or visualization.
Example:
NodeTypeStrings[A_INTEGER_LITERAL] == "A_INTEGER_LITERAL"
The construction of the AST is an essential part of the compilation process. It starts during the parsing phase, where the source code is analyzed according to the grammar rules of the language. The primary goal is to transform the linear sequence of tokens (produced during lexical analysis) into a hierarchical structure (the AST) that captures both the syntax and semantics of the program.
When the parsing process begins, the AST is initially empty. Think of it as a blank canvas waiting for nodes to be added. Each node represents a syntactic construct, and together they form a tree that reflects the nested, hierarchical structure of the source code.
For example:
- A program may consist of functions.
- Functions may consist of statements.
- Statements may include expressions, loops, or assignments.
At the start, the AST has no nodes because the parser has not yet encountered any constructs.
The AST uses the Left-Child Right-Sibling (LCRS) representation, which simplifies the structure by allowing each node to have at most two pointers:
- A pointer to its first child (representing the first element in a nested structure).
- A pointer to its next sibling (representing adjacent constructs).
This structure makes it possible to represent constructs with arbitrary branching (e.g., a function with many parameters) while maintaining a simple and efficient memory layout.
Nodes are created dynamically as the parser encounters constructs in the source code. For each syntactic rule in the grammar:
- A new node is created.
- The node is populated with information like its type and links to symbol tables.
- Child and sibling pointers are adjusted to reflect the tree's structure.
In the Argonaut code a node can be created using the construct_node
function :
Node* construct_node(NodeType type, int index_lexicographic, int index_declaration) {
Node* new_node = (Node*) malloc(sizeof(Node));
if (!new_node) {
perror("Failed to allocate memory for AST node");
exit(EXIT_FAILURE);
}
new_node->type = type;
new_node->index_lexicographic = index_lexicographic;
new_node->index_declaration = index_declaration;
new_node->child = NULL;
new_node->sibling = NULL;
return new_node;
}
Default nodes can be created without specific indices:
Node* construct_node_default(NodeType type) {
return construct_node(type, NULL_VALUE, NULL_VALUE);
}
For example, when the parser encounters a variable declaration, it creates a node of type A_VARIABLE_DECLARATION
and links it to the corresponding identifier and type information from the symbol tables.
variable_declaration: VARIABLE IDENTIFIER TWO_POINTS type SEMICOLON {
declaration_variable_start($2, $4);
$$ = construct_node(A_VARIABLE_DECLARATION, $2, find_declaration_index($4));
}
;
This rule:
- Starts the variable declaration process.
- Creates an
A_VARIABLE_DECLARATION
node with the identifier and type indices. - Links the node appropriately in the AST.
The AST grows as nodes are linked together to form subtrees. This happens recursively:
-
Adding Children: When a construct has nested components, child nodes are added to represent them. For example:
- An if statement node may have a child node for its condition and another for its body.
-
Adding Siblings: When constructs occur sequentially, sibling nodes are added to represent them. For example:
- In a block of statements, each statement is added as a sibling of the previous one.
The tree structure captures the nesting and ordering of constructs in the program.
We use the defined functions to add children and siblings:
- Add a Child:
void add_child(Node* parent, Node* child) {
if (!parent || !child) return;
child->sibling = parent->child;
parent->child = child;
}
- Add a Sibling:
void add_sibling(Node* node, Node* sibling) {
if (!node || !sibling) return;
while (node->sibling) {
node = node->sibling;
}
node->sibling = sibling;
}
- Append a Child or Sibling:
For scenarios requiring sequential additions:
void append_child(Node* parent, Node* child);
void append_sibling(Node* node, Node* sibling);
Complex constructs often require constructing subtrees before integrating them into the larger AST (e.g. the function_declaration
and procedure_declaration
rules). The subtree construction process ensures that even deeply nested constructs are represented correctly and efficiently.
Subtrees can be constructed by chaining nodes using add_chain
. For example:
void add_chain(Node* parent, Node* nodes[], int count) {
Node* current = NULL;
for (int i = 0; i < count; i++) {
if (!is_node_null(nodes[i])) {
if (current == NULL) {
add_child(parent, nodes[i]);
current = nodes[i];
} else {
add_sibling(current, nodes[i]);
current = nodes[i];
}
}
}
}
The AST construction is tightly integrated with the parser, which is generated from the grammar file (src/parser/grammar.y
). Each rule in the grammar corresponds to a specific syntactic construct, and the associated code creates and links the necessary nodes.
For example:
- When the parser encounters a rule for a function declaration, it creates a node for the function and links nodes for its parameters, body, and return type.
The parser essentially "walks" through the source code, building the AST piece by piece according to the language's grammar.
When the parser encounters a function declaration, it creates a node for the function and links nodes representing its parameters, body, and return type. Below is the grammar rule for a function declaration with detailed comments explaining each step:
function_declaration:
// Step 1: Recognize the 'FUNCTION' keyword and the function identifier
FUNCTION IDENTIFIER {
construct_func_proc_manager_context($2); // Initialize context for the function
declaration_func_start(); // Mark the start of a function declaration
}
// Step 2: Parse the function's parameter list
OPEN_PARENTHESIS parameter_list CLOSE_PARENTHESIS RETURN_TYPE type {
update_declaration_func_return_type($8); // Set the return type for the function in the symbol table
}
// Step 3: Parse the function's body, including declarations and statements
START declaration_list statement_list END {
// Create a new node for the function declaration
$$ = construct_node(A_FUNCTION_DECLARATION, $2, find_declaration_index_by_nature($2, TYPE_FUNC));
// Create an array of child nodes: parameters, declarations, and statements
Node *nodes[] = { $5, $11, $12 };
// Link the child nodes to the function node
add_chain($$, nodes, 3);
// Update the region's AST with the constructed function node
update_region_ast(peek_region(), $$);
// Finalize the function declaration context
declaration_func_proc_end();
// Reconstruct the function declaration node for the region (if needed)
$$ = construct_node(A_FUNCTION_DECLARATION, $2, find_declaration_index_by_nature($2, TYPE_FUNC));
}
;
-
Step 1: Recognizing the Function Start
- The parser identifies the
FUNCTION
keyword followed by the function identifier (IDENTIFIER
). - The semantic action initializes a function/procedure context (
construct_func_proc_manager_context
) to manage attributes like parameters and return types. - The
declaration_func_start()
function marks the beginning of a new function declaration in the symbol table.
construct_func_proc_manager_context($2); // Prepare function-specific data declaration_func_start(); // Initialize function declaration process
- The parser identifies the
-
Step 2: Parsing the Parameters and Return Type
- The function's parameter list is parsed between
OPEN_PARENTHESIS
andCLOSE_PARENTHESIS
. - After parsing
RETURN_TYPE
and thetype
keyword, the semantic action updates the function's return type in the symbol table (update_declaration_func_return_type
).
update_declaration_func_return_type($8); // Record the return type in the function's symbol table entry
- The function's parameter list is parsed between
-
Step 3: Parsing the Function Body
- The function body starts with
START
and ends withEND
, containing a list of declarations (declaration_list
) and executable statements (statement_list
). - A new node (
A_FUNCTION_DECLARATION
) is created to represent the function in the AST:-
$2
is the identifier for the function name. -
find_declaration_index_by_nature
fetches the declaration index for the function.
-
$$ = construct_node(A_FUNCTION_DECLARATION, $2, find_declaration_index_by_nature($2, TYPE_FUNC));
- An array of child nodes is created to represent the function's parameters, declarations, and statements:
-
$5
: Represents the parameter list. -
$11
: Represents declarations inside the function body. -
$12
: Represents the statements inside the function body.
-
Node *nodes[] = { $5, $11, $12 };
- The
add_chain
function links these child nodes to the function node.
add_chain($$, nodes, 3); // Link the child nodes to the parent function node
- The constructed function node is added to the current region's AST.
update_region_ast(peek_region(), $$); // Update the region's AST with this function node
- The function body starts with
-
Finalization
- The
declaration_func_proc_end()
function finalizes the function declaration context, ensuring all necessary data is stored and indexed correctly. - If required, the function declaration node is reconstructed for further usage in the region.
declaration_func_proc_end(); // Finalize the function declaration $$ = construct_node(A_FUNCTION_DECLARATION, $2, find_declaration_index_by_nature($2, TYPE_FUNC));
- The
For a function foo
with the following structure:
func foo(x: int) -> int {
var y: int;
y := x + 1;
return y;
}
The constructed AST might look like this (simplified):
├── A_FUNCTION_DECLARATION, Lexico Idx: 4, Decl Idx: 4 -- Lexeme: 'foo'
│ ├── A_PARAMETER_LIST, Lexico Idx: -1, Decl Idx: -1
│ │ ├── A_PARAMETER, Lexico Idx: 5, Decl Idx: 5 -- Lexeme: 'x'
│ └── A_DECLARATION_LIST, Lexico Idx: -1, Decl Idx: -1
│ │ ├── A_VARIABLE_DECLARATION, Lexico Idx: 6, Decl Idx: 6 -- Lexeme: 'y'
│ └── A_STATEMENT_LIST, Lexico Idx: -1, Decl Idx: -1
│ │ ├── A_ASSIGNMENT_STATEMENT, Lexico Idx: -1, Decl Idx: -1
│ │ │ ├── A_VARIABLE_ASSIGNMENT, Lexico Idx: 6, Decl Idx: 6 -- Lexeme: 'y'
│ │ │ │ ├── A_ADD_OP, Lexico Idx: -1, Decl Idx: -1
│ │ │ │ │ ├── A_IDENTIFIER, Lexico Idx: 5, Decl Idx: 5 -- Lexeme: 'x'
│ │ │ │ │ └── A_INTEGER_LITERAL, Lexico Idx: 7, Decl Idx: -1 -- Lexeme: '1'
│ │ └── A_RETURN_STATEMENT, Lexico Idx: -1, Decl Idx: -1
│ │ │ ├── A_IDENTIFIER, Lexico Idx: 6, Decl Idx: 6 -- Lexeme: 'y'
By tightly coupling AST construction with the parser, the compiler efficiently builds a hierarchical structure representing the program. This ensures that each syntactic construct is accurately captured and linked for subsequent compilation phases.