Semantic Analysis - Micrified/MiniPascal GitHub Wiki

NOTE: Semantic Analysis is way out of data and doesn't represent the current semantic analysis scheme whatsoever.

Semantic Analysis for MiniPascal doesn't make use of a ABS (abstract syntax tree). That makes implementing all the checks a bit more difficult than usual I think. I've gone ahead and made the following bunch of supporting constructs for the semantic analysis.

New Files

mptypes.h: A header file to contain the types that the mpascal.y file will use, as well as it's own token-types.
debug.c/debug.h: (Not technically new) A set of files that buffer the line being read, and provide formatted printing methods for errors and warnings.
numtab.c/numtab.h: A table of constants. Seems a bit silly, but I wanted one of the fields used in the type returned within the mpascal.y file to have a "pointer" to a number (if it was a constant). I didn't literally want to store the number there because then there's no way to signal "no-number". It's now an index into the symbol table, and when the index is -1 or NIL it indicates there is no constant provided.
strtab.c/strtab.h: A table that holds strings. Mostly used by symtab.h.
symtab.c/symtab.h: The symbol table. Essentially stores a token-type (tt) along with an identifier. Mostly used by mpascal.y to check if an identifier exists or not (Yes, there is no constant field stored within the symbol table. This means that if you assign x the value 1 and then reference x later in an expression, it will know x is of type TT_INTEGER, but not of the constant value). This could be changed later, maybe.

Defining YYSTYPE

It's necessary to define exactly what each nonterminal rule can return in Bison. An example of this was given in the calculator tutorial exercise.

%union {
    float fval;
    char *strtabptr;
}

This union type makes sense for a calculator, but not so much for a compiler. For one, a union only allows you to return one type at a time. We need more. The factor in MiniPascal may be a number, or an identifier. This means it must be able to return both at any time. Hence, defining the token factor as %token <fval> factor, or %token <strtabptr> factor is not complete.

Anyways, I only plan on implementing the expression rule for the meantime (and all rules that expression leads to). It then suffices to build a struct that has all the fields you might need for an expression and its sub-rules. I made the following struct:

typedef struct {
    unsigned    tt;        // Token-Type.
    int         vi;        // Value-Index: Index of constant value in numtab.
} exprType;

A constant like 3 gets the token-type TT_INTEGER, and a index into the number table which holds the value 3. Alternatively, an identifier x declared as TT_INTEGER gets the exact same struct, but with a value-index to NIL. Like I mentioned before, only constants keep their literal values. The moment you introduce a variable or index into a array, then it has to be saved to the symbol table (which doesn't store any literals) and the value is lost.

In any case, the union type for our parser is:

%union {
  exprType expr;
}

When we expand beyond expressions, we can include additional fields like: exprListType exprList, and assignType assignment for the other rules (made those up).

Factor Rule

As an example, I'll just explain how the factor rule works. Before this, I'll list what the token-types are.

Token-Types

Yes, we already have tokens returned by the mpascal.lex lexer (although defined in mpascal.y). But they aren't suitable for the semantic analysis stuff. Here's the list of new tokens so far (mptypes.h).

// Token-Types: Primitives.
#define TT_UNDEFINED            0
#define TT_INTEGER              1
#define TT_REAL                 2

// Token-Class: Array Mask.
#define TC_ARRAY                4

// Token-Types: Arrays.
#define TT_ARRAY_INTEGER        4
#define TT_ARRAY_REAL           5

// Token-Class: Function Mask.
#define TC_FUNCTION             8

// Token-Types: Functions.
#define TT_FUNCTION_INTEGER     8
#define TT_FUNCTION_REAL        9

// Token-Types: Boolean.
#define TT_BOOLEAN              16

You might wonder what this "token classes" do. Well, that will be explained next. After that I'll get back to the factor rule implementation. For now though, I'll just show the rule so I can reference it.

factor  : MP_ID                                           { $$ = (exprType){.tt = getTypeElseInstall(yytext, TT_UNDEFINED), .vi = NIL}; }
        | MP_ID MP_POPEN expressionList MP_PCLOSE         { $$ = (exprType){.tt = resolveTypeClass(yytext, TC_FUNCTION), .vi = NIL}; }
        | MP_ID MP_BOPEN expression MP_BCLOSE             { $$ = (exprType){.tt = resolveTypeClass(yytext, TC_ARRAY), .vi = NIL}; }
        | MP_INTEGER                                      { $$ = (exprType){.tt = TT_INTEGER, .vi = installNumber(atof(yytext))}; }
        | MP_REAL                                         { $$ = (exprType){.tt = TT_REAL, .vi = installNumber(atof(yytext))}; }
        | MP_POPEN expression MP_PCLOSE                   { $$ = $2; }
        ;

A factor may be a identifier like x, or a function call like: x(<expressionList>), or an array index like x[2+2]. When an array like x get declared, it obviously has a type. I.E: An array of integers. Therefore you'll give it a token like TT_ARRAY_INTEGER. However, you'll often just want to know if it's an array, and if you need, what primitive type it contains (integer/real). The classes you see next to the tokens serve to do broad categorization.

TT_ARRAY_INTEGER and TT_ARRAY_REAL have values 4 and 5 respectively. The mask for TC_ARRAY is 4. If you use TC_ARRAY as a mask and apply a bitwise-AND to two values, then you'll get a nonzero return value since both use have the third bit set. Notice no other types do.
TT_FUNCTION_INTEGER and TT_FUNCTION_REAL have values 8 and 9 respectively. The mask for TC_FUNCTION is 8. If you use it as a mask like described before, then you'll get a nonzero return value since both have the fourth bit set.

This is how you can easily determine the category of a TT_<type> using a TC_<class>.

As for "reducing" more abstract types like arrays and functions to primitives, notice that you can use the operation: x -> 1 + (x mod 2) to all of these values (except the primitives) to get one of the two primitive values.

Applying x -> 1 + (x mod 2) to TT_ARRAY_REAL results in 1 + (5 mod 2) which is 2 (so TT_REAL).
Applying x -> 1 + (x mod 2) to TT_FUNCTION_INTEGER results in 1 + (8 mod 2) which is 1 (so TT_INTEGER).

This concludes my explanation of the token-type and token-classes. It's basically just an explanation for why the numbers are the way they are. The only utility they serve is to make it easier to implement the semantic checking functions.

Back to the factor rule

Lets now explain each of the lines in the factor rule:

When you encounter MP_ID, you basically want to look up the id in the symbol table, and get the token-type tt. If you can't get that, then it's clearly undefined, and you return an exprType with a token-type TT_UNDEFINED. That most likely will trigger an error later up in the expression if its used.
When you encounter a function call, you need to do more than just lookup the token-type. You need to ensure that the token-type is of class TC_FUNCTION and then you need to deduce the return type it has (TT_INTEGER or TT_REAL). Therefore, I wrote functions to "resolve" the token-type you expect to its primitive.
When you encounter an array index, you do the same sort of steps as you do when encountering a function call. You must expect that an identifier exists, and that it has a token-type of class TC_ARRAY. Finally, you must extract the primitive type from that and return it as the new token-type to the rule that uses it.
Encountering an MP_INTEGER allows you to directly return a exprType with a token-type of TT_INTEGER and a constant value that you interpreted from the yytext value.
Encountering an MP_REAL allows you to directly return a exprType with token-type TT_REAL and a constant value that you interpreted from the yytext value.

And that basically concludes how the factor rule checking is implemented. As you go higher up the grammar, you run into terms and simpleExpressions. Both of these involve arithmetic operations like addition, subtraction, division, etc. Here is where I do reductions on literal expressions and check for divisions by zero.

Things to do.

We don't have any block levels in use yet.
We probably should save constant values to the symbol table. Nobody really explicitly divides by zero...
We need to implemented the remaining rules besides expressions.
We need to fix a lot of the bugs with the error output. Often times the lines shown are wrong.