Writing Tokenisers - RopleyIT/GLRParser GitHub Wiki

Writing an input tokeniser

Any parser, be it in-line or offline, requires a source of sequenced input tokens. These tokens must have a token type selected from the list of token types listed in the tokens section of the grammar description. Each token can be accompanied by a value. The value can be of any data type, but that type should be known wherever the value is referred to in the parser's action functions.

Exactly the same data structure is used for finite state machines. The only difference in this case is that everything except the token type is not used by the state machine engine. However, the IToken currently being parsed in the state machine is accessible via the CurrentEvent property of the FSM class.

Consequently, any parser or state machine expects a sequence of token type-value pairs to be fed to it for it to parse and act upon. The data structure the parser or state machine consumes a sequence of is:

    public interface IToken
    {
        /// <summary>
        /// The weakly-typed representation of the type of event,
        /// being a member of the list of token types retrieved
        /// from the parser's Tokens two way map by name lookup.
        /// </summary>
 
        int Type
        {
            get;
        }
 
        /// <summary>
        /// Data captured by the tokeniser that
        /// accompanies the input token type. For
        /// example, a token type might be
        /// INTEGER, and the Value property would
        /// be the parsed value of that integer.
        /// </summary>
 
        object Value
        {
            get;
            set;
        }
 
        /// <summary>
        /// Positional information for where the token came from.
        /// Could be the line number from which the token came,
        /// or a token number in a sequence.
        /// </summary>
         
        string Position
        {
            get;
        }
    }

In this data structure, the Type property is an integer representation of the kind of token that was received from the tokeniser. The names and values for each of these token types are specified in the input grammar description in the tokens section. You may be surprised that this is an integer rather than an enumerated data type. The reason for this is that we are supporting inline parsing. It is not plausible to read a grammar description, extract the token names and values from it, and create an enum type based on the token names and values, and to then have the runtime implementation know about that new data type. At least, it is not possible without resorting to reflection, which can be unacceptably slow.

Instead, we use plain integer values to represent the token types, and provide a mechanism to look up a token's name from its value, and a token's value from its name. An input tokeniser that wants to be immune from changes to token values within the grammar description can use this mechanism to look up token values by name. The mechanism is a two-way map, that is implemented as two dictionaries back-to-back, one providing lookup of an integer value from a character string name, the other providing lookup of a character string name from the integer value.

The Value property is weakly typed as an object. This allows the tokeniser to return arbitrary data types as the values associated with a particular type of token. These values are only encountered in the guard or action functions the programmer adds to their application specific parser class. Since these functions are written to handle specifically recognised contexts in the parsing process, it is asssumed that the developer knows the actual type to cast to, when retrieving these values within those functions.

The Position property allows the input tokeniser optionally to provide positioning information to the parser, for use in error reporting. For example, if the input token stream were symbols from an input document, like a programming language, it might be filled in by the tokeniser with a file name, line number, and maybe a line offset at which the symbol appears in the input file. If the input token stream is an event stream, such as a list of workflow events for example, the position information might include the user that raised the event, and the timestamp for the event. Whatever nature the position information holds, it is used for reporting on the debug stream and on the error output stream attached to the parser.

Writing the tokeniser

 The input tokeniser is usually constructed to implement an IEnumerable<IToken> so that the tokeniser itself can be passed as the argument to the Parse method of the parser object:

public class MyTokeniser : IEnumerable<IToken>
{
    /// <summary>
    /// Constructor for our tokeniser. Is passed the
    /// two-way map between token names and integer type
    /// values, so that the tokeniser can return the
    /// correct token types in each IToken.
    /// </summary>
     
    public MyTokeniser(TextReader src, TwoWayMap<string, int> tokens)
    {
        // ... store the token map and input stream somewhere ...
    }
     
    /// <summary>
    /// Return a strongly-typed enumerator over the
    /// set of tokens in the input stream.
    /// </summary>
    /// <returns>The token stream enumerator</returns>
 
    public IEnumerator<IToken> GetEnumerator()
    {
        return somethingInThisClass.GetEnumerator();
    }
 
    /// <summary>
    /// Return an enumerator over the
    /// set of tokens in the input stream.
    /// </summary>
    /// <returns>The token stream enumerator</returns>
 
    IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        return somethingInThisClass.GetEnumerator();
    }
}

Attaching the tokeniser to a parser

The input tokeniser is attached to its parser at the moment of beginning a parse. The Parsing.Parser class, which is the base class of the application-specific parser class, has a method called Parse that is passed an enumerable of IToken objects. If we have structured the tokeniser so that it implements the enumerable interface, some sample implementation code might be:

    MyParser p = ParserFactory<MyParser>.CreateInstance();
    MyTokeniser t = new MyTokeniser(someSource, p.Tokens);
    bool success = p.Parse(t);

Notice how the parser has a property Tokens that exposes the two way map between token names and their values as captured from the input grammar description. This property is available in each parser instance, or as a static property of the ParserFactory<MyParser> class.

Attaching the tokeniser to a finite state machine

The input tokeniser is attached to a state machine in one of two ways. Either an enumerable interface is implemented as for the parser example above, or since a state machine is intended to be event driven, it is possible to have the event source make individual calls on the state machine as events arrive. Thus the event source is pushing events onto the state machine engine, rather than the engine pulling events from an iterator. The Parsing.FSM class, which is the base class of the application-specific state machine class, has a method called Run that is passed an enumerable of IToken objects. If we have structured the tokeniser so that it implements the enumerable interface, some sample implementation code might be:

    MyFSM p = FSMFactory<MyFSM>.CreateInstance();
    MyTokeniser t = new MyTokeniser(someSource, p.Tokens);
    bool success = p.Run(t);

Notice how the state machine has a property Tokens that exposes the two way map between token names and their values as captured from the input grammar description. This property is also available as a static property of the FSMFactory<MyFSM> class.

As mentioned, the alternative way to couple the state machine object to its event source is to have the event source pass it input events one at a time as they arrive at the state machine application. To do this, we make calls upon the method ProcessEvent(IToken nextEvent) in the FSM instance once it has been created. Note that if the event was not valid for the current state, and the state machine has not been set up to ignore unrecognised events, this method will return false. Otherwise it will return true even in the presence of unrecognised events, which it just ignores.

The state machine object has a boolean property called Terminated. If a state transition causes the state machine to enter the terminated state, subsequent calls to ProcessEvent will return false, even if the state machine has been set up to ignore unrecognised events. Hence you should always check the Terminated flag between events if you suspect the event being fed to the state machine might terminate the machine.

⚠️ **GitHub.com Fallback** ⚠️