Writing Tokenisers - RopleyIT/GLRParser GitHub Wiki
Any parser, be it in-line or offline, requires a source of sequenced input
tokens. These tokens must have a token type selected from the list of token
types listed in the tokens
section of the grammar description. Each
token can be accompanied by a value. The value can be of any data type, but that
type should be known wherever the value is referred to in the parser's action
functions.
Exactly the same data structure is used for finite state machines. The only
difference in this case is that everything except the token type is not used by the state machine
engine. However, the IToken
currently being parsed in the state
machine is accessible via the CurrentEvent
property of the
FSM
class.
Consequently, any parser or state machine expects a sequence of token type-value pairs to be fed to it for it to parse and act upon. The data structure the parser or state machine consumes a sequence of is:
public interface IToken
{
/// <summary>
/// The weakly-typed representation of the type of event,
/// being a member of the list of token types retrieved
/// from the parser's Tokens two way map by name lookup.
/// </summary>
int Type
{
get;
}
/// <summary>
/// Data captured by the tokeniser that
/// accompanies the input token type. For
/// example, a token type might be
/// INTEGER, and the Value property would
/// be the parsed value of that integer.
/// </summary>
object Value
{
get;
set;
}
/// <summary>
/// Positional information for where the token came from.
/// Could be the line number from which the token came,
/// or a token number in a sequence.
/// </summary>
string Position
{
get;
}
}
In this data structure, the Type
property is an integer
representation of the kind of token that was received from the tokeniser. The
names and values for each of these token types are specified in the input
grammar description in the tokens
section. You may be surprised
that this is an integer rather than an enumerated data type. The reason for this
is that we are supporting inline parsing. It is not plausible to read a grammar
description, extract the token names and values from it, and create an
enum
type based on the token names and values, and to then have the
runtime implementation know about that new data type. At least, it is not
possible without resorting to reflection, which can be unacceptably slow.
Instead, we use plain integer values to represent the token types, and provide a mechanism to look up a token's name from its value, and a token's value from its name. An input tokeniser that wants to be immune from changes to token values within the grammar description can use this mechanism to look up token values by name. The mechanism is a two-way map, that is implemented as two dictionaries back-to-back, one providing lookup of an integer value from a character string name, the other providing lookup of a character string name from the integer value.
The Value
property is weakly typed as an object
.
This allows the tokeniser to return arbitrary data types as the values
associated with a particular type of token. These values are only encountered in
the guard or action functions the programmer adds to their application specific
parser class. Since these functions are written to handle specifically
recognised contexts in the parsing process, it is asssumed that the developer
knows the actual type to cast to, when retrieving these values within those
functions.
The Position
property allows the input tokeniser optionally to
provide positioning information to the parser, for use in error reporting. For
example, if the input token stream were symbols from an input document, like a
programming language, it might be filled in by the tokeniser with a file name,
line number, and maybe a line offset at which the symbol appears in the input
file. If the input token stream is an event stream, such as a list of workflow
events for example, the position information might include the user that raised
the event, and the timestamp for the event. Whatever nature the position
information holds, it is used for reporting on the debug stream and on the error
output stream attached to the parser.
The input tokeniser is usually constructed to
implement an IEnumerable<IToken>
so that the tokeniser itself can be passed as
the argument to the Parse
method of the parser object:
public class MyTokeniser : IEnumerable<IToken>
{
/// <summary>
/// Constructor for our tokeniser. Is passed the
/// two-way map between token names and integer type
/// values, so that the tokeniser can return the
/// correct token types in each IToken.
/// </summary>
public MyTokeniser(TextReader src, TwoWayMap<string, int> tokens)
{
// ... store the token map and input stream somewhere ...
}
/// <summary>
/// Return a strongly-typed enumerator over the
/// set of tokens in the input stream.
/// </summary>
/// <returns>The token stream enumerator</returns>
public IEnumerator<IToken> GetEnumerator()
{
return somethingInThisClass.GetEnumerator();
}
/// <summary>
/// Return an enumerator over the
/// set of tokens in the input stream.
/// </summary>
/// <returns>The token stream enumerator</returns>
IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return somethingInThisClass.GetEnumerator();
}
}
The input tokeniser is attached to its parser at the moment of beginning a
parse. The Parsing.Parser
class, which is the base class of the
application-specific parser class, has a method called Parse
that is passed an
enumerable of IToken
objects. If we have structured the tokeniser
so that it implements the enumerable interface, some sample implementation code
might be:
MyParser p = ParserFactory<MyParser>.CreateInstance();
MyTokeniser t = new MyTokeniser(someSource, p.Tokens);
bool success = p.Parse(t);
Notice how the parser has a property Tokens
that exposes the
two way map between token names and their values as captured from the
input grammar description. This property is available in each parser instance,
or as a static property of the ParserFactory<MyParser>
class.
The input tokeniser is attached to a state machine in one of two ways. Either an enumerable interface
is implemented as for the parser example above, or since a state machine is intended
to be event driven, it is possible to have the event source make individual calls on
the state machine as events arrive. Thus the event source is pushing events onto the
state machine engine, rather than the engine pulling events from an iterator.
The Parsing.FSM
class, which is the base class of the
application-specific state machine class, has a method called Run
that is passed an
enumerable of IToken
objects. If we have structured the tokeniser
so that it implements the enumerable interface, some sample implementation code
might be:
MyFSM p = FSMFactory<MyFSM>.CreateInstance();
MyTokeniser t = new MyTokeniser(someSource, p.Tokens);
bool success = p.Run(t);
Notice how the state machine has a property Tokens
that exposes the
two way map between token names and their values as captured from the
input grammar description. This property is also available as a static property
of the FSMFactory<MyFSM>
class.
As mentioned, the alternative way to couple the state machine object to its
event source is to have the event source pass it input events one at a time as
they arrive at the state machine application. To do this, we make calls upon the
method ProcessEvent(IToken nextEvent)
in the FSM instance once it
has been created. Note that if the event was not valid for the current state,
and the state machine has not been set up to ignore unrecognised events, this
method will return false. Otherwise it will return true even in the presence of
unrecognised events, which it just ignores.
The state machine object has a boolean property called Terminated
.
If a state transition causes the state machine to enter the terminated state,
subsequent calls to ProcessEvent
will return false
,
even if the state machine has been set up to ignore unrecognised events. Hence
you should always check the Terminated
flag between events if you
suspect the event being fed to the state machine might terminate the machine.