Lex Best Practices - fordsfords/fordsfords.github.io GitHub Wiki

This page contains some best practices and basic concepts for users of the Unix-centric "lex" tool for generating lexical scanners. It assumes that you have tried to read an introduction/tutorial, or you are about to.

Table of Contents

DISCLAIMER

I am not an expert with Yacc and Lex. In fact, I am literally just a beginner, having starting reading about it yesterday.

So this is recording my discoveries that smoothed my way to using the tool. The tutorials out there vary in quality (some are HORRIBLE), but few of them come right out and say some things that I think are critical to understanding how to use "lex".

That said, although I haven't yet found a tutorial that I like "a lot", here's one that taught me a lot: https://arcb.csc.ncsu.edu/~mueller/codeopt/codeopt00/y_man.pdf

The Three Basics

  1. You want your lex program to deal with the entire input file, and exit if it encounters something that doesn't scan right. (This is usually accomplished by having a "catch-all" error handler as the last rule.)
  2. If an input sequence matches more than one rule, Lex will select the rule that results in the *greatest* consumption (longest match). This is true even if the shorter matching rules appear earlier in the rule set.
  3. If an input sequence matches more than one rule with the same number of characters consumed, Lex will select the rule that appears earliest in the rule set.
None of the tutorials I read emphasized the importance of these items.

Example

The Lex scanner has a default behavior of printing things that don't match any rules, and then continuing on as if nothing had happened. But for the rules to work as expected, you should be confident that the scanner's input pointer is always at the start of something that should be the start of a token.

For example, consider the rule set:

%%
"if" printf("IF ");
[ \t\n]+ ;    /* Ignore (consume) whitespace. */

Let's say the input stream consists of "ifIwereKing". The scanner would successfully match the "if", leaving the pointer at "IwereKing". But this is not what most of us would consider the start of a token. It's the middle of an invalid token. At least most of us see it that way.

So if we want to restrict the input to only the words "if" separated by whitespace, use this:

%%
"if" print("IF ");
[ \t\n]+ ;    /* Ignore (consume) whitespace. */
[^ \t\n]+ printf("error at '%s'\n", yytext);  /* Catch-all for bad tokens. */

That third rule matches anything not matched above. With the input "ifIwereKing", the first rule matches the "if" and the third rule matches "ifIwereKing". Since the third rule consumes more input, that's what Lex uses, even though the "if" rule comes first.

So, what about the input string "if "? Once again, both the first and the third rules match the "if". But they both consume the same amount of input. So the first one in the rule set is used.

A Few More Rules

Comments

Yacc and lex do not have comment specifiers in their syntax. So, if you want to comment your lex and yacc input files, you'll need to place those comments in areas that lex and yacc expect C code to be. Those comments will just be handed to the C compiler as-is.

Only the "/* */" form of comments work. Don't try "//".

⚠️ **GitHub.com Fallback** ⚠️