Semantic Analysis - Micrified/MiniPascal GitHub Wiki
NOTE: Semantic Analysis is way out of data and doesn't represent the current semantic analysis scheme whatsoever.
Semantic Analysis for MiniPascal doesn't make use of a ABS (abstract syntax tree). That makes implementing all the checks a bit more difficult than usual I think. I've gone ahead and made the following bunch of supporting constructs for the semantic analysis.
-
mptypes.h
: A header file to contain the types that thempascal.y
file will use, as well as it's own token-types. -
debug.c/debug.h
: (Not technically new) A set of files that buffer the line being read, and provide formatted printing methods for errors and warnings. -
numtab.c/numtab.h
: A table of constants. Seems a bit silly, but I wanted one of the fields used in the type returned within thempascal.y
file to have a "pointer" to a number (if it was a constant). I didn't literally want to store the number there because then there's no way to signal "no-number". It's now an index into the symbol table, and when the index is-1
orNIL
it indicates there is no constant provided. -
strtab.c/strtab.h
: A table that holds strings. Mostly used bysymtab.h
. -
symtab.c/symtab.h
: The symbol table. Essentially stores a token-type (tt
) along with anidentifier
. Mostly used bympascal.y
to check if an identifier exists or not (Yes, there is no constant field stored within the symbol table. This means that if you assignx
the value1
and then referencex
later in an expression, it will knowx
is of typeTT_INTEGER
, but not of the constant value). This could be changed later, maybe.
It's necessary to define exactly what each nonterminal rule can return in Bison. An example of this was given in the calculator tutorial exercise.
%union {
float fval;
char *strtabptr;
}
This union type makes sense for a calculator, but not so much for a compiler. For one, a union only allows you to return one type at a time. We need more. The factor in MiniPascal may be a number, or an identifier. This means it must be able to return both at any time. Hence, defining the token factor
as %token <fval> factor
, or %token <strtabptr> factor
is not complete.
Anyways, I only plan on implementing the expression
rule for the meantime (and all rules that expression leads to). It then suffices to build a struct
that has all the fields you might need for an expression and its sub-rules. I made the following struct:
typedef struct {
unsigned tt; // Token-Type.
int vi; // Value-Index: Index of constant value in numtab.
} exprType;
A constant like 3
gets the token-type TT_INTEGER
, and a index into the number table which holds the value 3
. Alternatively, an identifier x
declared as TT_INTEGER
gets the exact same struct, but with a value-index to NIL. Like I mentioned before, only constants keep their literal values. The moment you introduce a variable or index into a array, then it has to be saved to the symbol table (which doesn't store any literals) and the value is lost.
In any case, the union
type for our parser is:
%union {
exprType expr;
}
When we expand beyond expressions, we can include additional fields like: exprListType exprList
, and assignType assignment
for the other rules (made those up).
As an example, I'll just explain how the factor rule works. Before this, I'll list what the token-types are.
Yes, we already have tokens returned by the mpascal.lex
lexer (although defined in mpascal.y
). But they aren't suitable for the semantic analysis stuff. Here's the list of new tokens so far (mptypes.h
).
// Token-Types: Primitives.
#define TT_UNDEFINED 0
#define TT_INTEGER 1
#define TT_REAL 2
// Token-Class: Array Mask.
#define TC_ARRAY 4
// Token-Types: Arrays.
#define TT_ARRAY_INTEGER 4
#define TT_ARRAY_REAL 5
// Token-Class: Function Mask.
#define TC_FUNCTION 8
// Token-Types: Functions.
#define TT_FUNCTION_INTEGER 8
#define TT_FUNCTION_REAL 9
// Token-Types: Boolean.
#define TT_BOOLEAN 16
You might wonder what this "token classes" do. Well, that will be explained next. After that I'll get back to the factor rule implementation. For now though, I'll just show the rule so I can reference it.
factor : MP_ID { $$ = (exprType){.tt = getTypeElseInstall(yytext, TT_UNDEFINED), .vi = NIL}; }
| MP_ID MP_POPEN expressionList MP_PCLOSE { $$ = (exprType){.tt = resolveTypeClass(yytext, TC_FUNCTION), .vi = NIL}; }
| MP_ID MP_BOPEN expression MP_BCLOSE { $$ = (exprType){.tt = resolveTypeClass(yytext, TC_ARRAY), .vi = NIL}; }
| MP_INTEGER { $$ = (exprType){.tt = TT_INTEGER, .vi = installNumber(atof(yytext))}; }
| MP_REAL { $$ = (exprType){.tt = TT_REAL, .vi = installNumber(atof(yytext))}; }
| MP_POPEN expression MP_PCLOSE { $$ = $2; }
;
A factor
may be a identifier like x
, or a function call like: x(<expressionList>)
, or an array index like x[2+2]
. When an array like x
get declared, it obviously has a type. I.E: An array of integers. Therefore you'll give it a token like TT_ARRAY_INTEGER
. However, you'll often just want to know if it's an array, and if you need, what primitive type it contains (integer/real). The classes you see next to the tokens serve to do broad categorization.
-
TT_ARRAY_INTEGER
andTT_ARRAY_REAL
have values 4 and 5 respectively. The mask forTC_ARRAY
is 4. If you useTC_ARRAY
as a mask and apply a bitwise-AND to two values, then you'll get a nonzero return value since both use have the third bit set. Notice no other types do. -
TT_FUNCTION_INTEGER
andTT_FUNCTION_REAL
have values 8 and 9 respectively. The mask forTC_FUNCTION
is 8. If you use it as a mask like described before, then you'll get a nonzero return value since both have the fourth bit set.
This is how you can easily determine the category of a TT_<type>
using a TC_<class>
.
As for "reducing" more abstract types like arrays and functions to primitives, notice that you can use the operation: x -> 1 + (x mod 2)
to all of these values (except the primitives) to get one of the two primitive values.
-
Applying
x -> 1 + (x mod 2)
to TT_ARRAY_REAL results in1 + (5 mod 2)
which is 2 (soTT_REAL
). -
Applying
x -> 1 + (x mod 2)
to TT_FUNCTION_INTEGER results in1 + (8 mod 2)
which is 1 (soTT_INTEGER
).
This concludes my explanation of the token-type and token-classes. It's basically just an explanation for why the numbers are the way they are. The only utility they serve is to make it easier to implement the semantic checking functions.
Lets now explain each of the lines in the factor rule:
- When you encounter
MP_ID
, you basically want to look up theid
in the symbol table, and get the token-typett
. If you can't get that, then it's clearly undefined, and you return anexprType
with a token-typeTT_UNDEFINED
. That most likely will trigger an error later up in the expression if its used. - When you encounter a function call, you need to do more than just lookup the token-type. You need to ensure that the token-type is of class
TC_FUNCTION
and then you need to deduce the return type it has (TT_INTEGER
orTT_REAL
). Therefore, I wrote functions to "resolve" the token-type you expect to its primitive. - When you encounter an array index, you do the same sort of steps as you do when encountering a function call. You must expect that an identifier exists, and that it has a token-type of class
TC_ARRAY
. Finally, you must extract the primitive type from that and return it as the new token-type to the rule that uses it. - Encountering an
MP_INTEGER
allows you to directly return aexprType
with a token-type ofTT_INTEGER
and a constant value that you interpreted from theyytext
value. - Encountering an
MP_REAL
allows you to directly return aexprType
with token-typeTT_REAL
and a constant value that you interpreted from theyytext
value.
And that basically concludes how the factor rule checking is implemented. As you go higher up the grammar, you run into terms and simpleExpressions. Both of these involve arithmetic operations like addition, subtraction, division, etc. Here is where I do reductions on literal expressions and check for divisions by zero.
- We don't have any block levels in use yet.
- We probably should save constant values to the symbol table. Nobody really explicitly divides by zero...
- We need to implemented the remaining rules besides expressions.
- We need to fix a lot of the bugs with the error output. Often times the lines shown are wrong.