Lexical Analysis

What does this phase do?

Lexical analysis analyze whether a stream of characters(separated by a white-space) is a valid token. Let's take a snippet written in C for example:

const long var_long = 9;
char 09char = 'A';

const, long, 9, 'A', ;, 09char are all character streams separated by one or more white-space. But only the first fives are valid tokens, the last is not invalid, since identifiers(tokens) in C can only consist of alphanumeric characters or underscores and cannot start with digits.

That is what this phase do, it analyzes contents in files written tiger language to see whether they are all valid tokens. For details about the definition of identifiers, see Tiger Language Reference Manual page of this wiki.

Interface

do.h
do.c

Users can use interfaces provided in these files to do lexical analyzing.

Tools used

Here, as described in tiger book, i used Lex to do the lexical analyze work of this compiler.

Position tracking

Position tracking is necessary, there are two main reasons:

Later phases used : In this compiler, syntax analysis and semantic analysis are down in different time, but these two phases both need position information to some meaningful things, so, we have to keep token positions.
Error reporting : There are also many phases where position information is needed when reporting errors.

Data keeping

Source files are actually plain strings written in certain format. Among these strings, some are important and useful, such as integer values and strings values. So, while processing, we have to keep these data into some places. Here, i make some memory locations to hold these data (see lexString.c)

Escape characters

Why needed?

Some special characters are hard to type and see, these include non-printable characters and control characters。

Process

Tiger language supports escape characters. To treat these "different" characters uniformly, we have to translate them into real characters.

New lines

CRLF(carriage return, line feed) and LF(line feed) are two different representations of newline in Windows and Linux platforms. So, under Linux systems, we have to use \n to represent a newline; under Windows systems, we have to use \r\n to represent a newline.

Error report

When analysis encountered an error, it needs to report it. Maybe printing error positions and text.

Sources

Lex files:

myTiger.lex

Error report and position tracking:

myReport.h
myReport.c

String escape:

stringEscape.h
stringEscape.c

Special files:

yy.lex.c : The lexical analyzer generated by running Lex with command lex myTiger.lex.
y.tab.h : Header defining Yacc data types(include token definitions) generated by running Yacc with command yacc -dv myTiger.y.

Alternative files:

myTokens.h : Header defining tokens. Users can replace y.tab.h with this file in this part, but if you want to get the compiler, y.tab.h is needed.

Lexical Analysis - lgxZJ/Tiger-Compiler GitHub Wiki

Lexical Analysis

What does this phase do?

Interface

Tools used

Position tracking

Data keeping

Escape characters

Why needed?

Process

New lines

Error report

Sources

⚠️ GitHub.com Fallback ⚠️

Lexical Analysis - lgxZJ/Tiger-Compiler GitHub Wiki

Lexical Analysis

What does this phase do?

Interface

Tools used

Position tracking

Data keeping

Escape characters

Why needed?

Process

New lines

Error report

Sources

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️