Data Pipelining and Random Number Sequencing Design - ActivitySim/activitysim GitHub Wiki

Scope of Work

Use Cases

Restartable model runs. For example, we have a model with three sub-models A, B, and C. Yesterday we ran sub-models A, B, and C and today we want to just run sub-model C with different settings. We need to keep track of the state of the system after sub-models A and B are run since they are presumably inputs to sub-model C.
We need to run a sub-model for a household, person, tour, trip, etc. and get the same random number draw regardless of the computer used, if other households are being simulated at the same time, or if inputs changed (i.e. it is a different scenario). For example household vehicle ownership, person work location, tour mode choice.
We need as-stable-as-possible random number sequencing across scenarios and sample rates for households, persons, tours, trips, sub-models, etc.
We need stable random numbers with restartable data pipelining

Design Ideas

Create our own framework so it can work with orca
We are reducing our dependency on orca, but not abandoning it since that would be too expensive
Orca tables are being saved to the datastore as pandas data frames and then being wrapped as orca tables on I/O
Each household, person, tour, trip, sub-model has a random number stream and offset. For example, when the model runs sub-model A it uses the first offset, sub-model B uses the second offset, and sub-model C uses the third offset. If we restart the model run at sub-model C, it sees in the datastore that sub-models A and B were run and that offsets 1 and 2 have already been used as well.
The offsets are by sub-model run order, not sub-model name; this is more flexible and avoids requiring an a priori dictionary

Quick Overview of the In-Development Implementation

Pipeline

The revised model run setup looks like this:

_MODELS = [
    'compute_accessibility',
    'school_location_simulate',
    'workplace_location_simulate',
    'auto_ownership_simulate',
    'cdap_simulate',
    'mandatory_tour_frequency',
    'mandatory_scheduling',
    'non_mandatory_tour_frequency',
    'destination_choice',
    'non_mandatory_scheduling',
    'tour_mode_choice_simulate',
    # 'trip_mode_choice_simulate'
]

#resume_after = 'mandatory_scheduling'
resume_after = None

pipeline.get_rn_generator().set_base_seed(0) #global seed

pipeline.run(models=_MODELS, resume_after=resume_after)

Here is the contents of the data pipeline HDF5 file, which contains the state of pandas DataFrames after each sub-model if they are revised by the sub-model. You can see that the number of columns changes as the sub-models are run.

<class 'pandas.io.pytables.HDFStore'> File path: pipeline.h5

/accessibility/compute_accessibility              (shape->[25,21])

/checkpoints                                      (shape->[12,11])

/households/compute_accessibility                 (shape->[1000,64])
/households/auto_ownership_simulate               (shape->[1000,67])
/households/cdap_simulate                         (shape->[1000,68])

/land_use/compute_accessibility                   (shape->[25,49])

/mandatory_tours/mandatory_tour_frequency         (shape->[766,4])
/mandatory_tours/mandatory_scheduling             (shape->[766,5])

/non_mandatory_tours/non_mandatory_tour_frequency (shape->[1256,4])
/non_mandatory_tours/destination_choice           (shape->[1256,5])
/non_mandatory_tours/non_mandatory_scheduling     (shape->[1256,6])

/persons/compute_accessibility                    (shape->[1549,50])
/persons/school_location_simulate                 (shape->[1549,54])
/persons/workplace_location_simulate              (shape->[1549,59])
/persons/cdap_simulate                            (shape->[1549,64])
/persons/mandatory_tour_frequency                 (shape->[1549,69])
/persons/non_mandatory_tour_frequency             (shape->[1549,72])

/tours/tour_mode_choice_simulate                  (shape->[2022,37])

Random number sequencing

Random number generation is done using numpy's Mersenne Twister PNRG
ActivitySim uses a stream of random numbers for each household id, person id, tour id, (soon trip id), and model step offset
The seed (offset/starting point) is based on the global seed, household id, person id, tour id, (soon trip id), and model step offset. The equation looks something like this:

chooser.index * number of models for chooser + chooser model offset + global seed offset

for example
  household.id * 2 + 0 + 1
where:
  household.id = household table index
  2 = number of household level models - auto ownership and cdap
  0 = first household model - auto ownership
  1 = global seed offset for testing the same model under different random global seeds

Tour id is segmented by tour type
The sequencing is thread/process safe for eventual multiprocessor support