Lexical Analysis - lgxZJ/Tiger-Compiler GitHub Wiki
Lexical analysis analyze whether a stream of characters(separated by a white-space) is a valid token. Let's take a snippet written in C for example:
const long var_long = 9;
char 09char = 'A';
const
, long
, 9
, 'A'
, ;
, 09char
are all character streams separated by one or more white-space. But only the first fives are valid tokens, the last is not invalid, since identifiers(tokens) in C can only consist of alphanumeric characters or underscores and cannot start with digits.
That is what this phase do, it analyzes contents in files written tiger language to see whether they are all valid tokens. For details about the definition of identifiers, see Tiger Language Reference Manual page of this wiki.
do.h
do.c
Users can use interfaces provided in these files to do lexical analyzing.
Here, as described in tiger book, i used Lex to do the lexical analyze work of this compiler.
Position tracking is necessary, there are two main reasons:
- Later phases used : In this compiler, syntax analysis and semantic analysis are down in different time, but these two phases both need position information to some meaningful things, so, we have to keep token positions.
- Error reporting : There are also many phases where position information is needed when reporting errors.
Source files are actually plain strings written in certain format. Among these strings, some are important and useful, such as integer values and strings values. So, while processing, we have to keep these data into some places. Here, i make some memory locations to hold these data (see lexString.c)
Some special characters are hard to type and see, these include non-printable characters and control characters。
Tiger language supports escape characters. To treat these "different" characters uniformly, we have to translate them into real characters.
CRLF(carriage return, line feed) and LF(line feed) are two different representations of newline in Windows and Linux platforms. So, under Linux systems, we have to use \n
to represent a newline; under Windows systems, we have to use \r\n
to represent a newline.
When analysis encountered an error, it needs to report it. Maybe printing error positions and text.
Lex files:
myTiger.lex
Error report and position tracking:
myReport.h
myReport.c
String escape:
stringEscape.h
stringEscape.c
Special files:
-
yy.lex.c
: The lexical analyzer generated by running Lex with commandlex myTiger.lex
. -
y.tab.h
: Header defining Yacc data types(include token definitions) generated by running Yacc with commandyacc -dv myTiger.y
.
Alternative files:
-
myTokens.h
: Header defining tokens. Users can replacey.tab.h
with this file in this part, but if you want to get the compiler,y.tab.h
is needed.