Step 1: Enriching initial transaction data - aclong/dummy_data_linkage GitHub Wiki

The code for this step is in this code.

As each transaction records the machine it was made on, grouping the transactions together by machine tracks the movement of this machine throughout out the day. As the transaction also records the operator code and the route number, these transactions can also be grouped by these details (machines can be used on more than one bus route throughout the day).

Other than time, the other dimension of the data I use to organise the data is the fare stage. This is input by the driver on the ticket machine in order to generate the correct price for tickets. The fare stage moves up as the bus travels in one direction along a route (this will be called "out"), and down as it moves in the other ("in").

This means that the fare stage, or change in fare stage can be used to work out the direction of travel of the machine, and therefore the bus it is on.

See below for a graph that shows the results for one machine. It shows the change in fare stage in the y axis and the transaction datetime in the other.

(INSERT FIG OF NON-DIRECTIONAL TRANSACTION POINTS)

I will use the relative position in time and fare stage to link these transactions together by direction (whether it is a step up or down in fare stage), and then once these directions have been linked together. I will assign the joined transactions of the same direction within a given framework as a "journey", as one movement in one direction along the course of the bus route.

(INSERT FIG WITH DIRECTIONS HERE)

This technique is implemented in SQL and works to assign directions to any transactions that are within a given time window of the next one I am working to create a version where the time window is flexible for the different routes, distinguishing how long a given bus route is and so building in a flexible heuristic that decides whether or not two transactions can be considered to be part of the same journey.

With a relatively busy route like the one shown above this time window aspect isn't necessary and the code would work on which each transactions nearest neighbour that is a different fare stage is and assign a direction relative to that.

The transaction then does a last run where it considers each change in direction or gap in transactions larger than the time window a "new" journey.