Data Pipelining and Random Number Sequencing Design - ActivitySim/activitysim GitHub Wiki

Scope of Work

Use Cases

  • Restartable model runs. For example, we have a model with three sub-models A, B, and C. Yesterday we ran sub-models A, B, and C and today we want to just run sub-model C with different settings. We need to keep track of the state of the system after sub-models A and B are run since they are presumably inputs to sub-model C.
  • We need to run a sub-model for a household, person, tour, trip, etc. and get the same random number draw regardless of the computer used, if other households are being simulated at the same time, or if inputs changed (i.e. it is a different scenario). For example household vehicle ownership, person work location, tour mode choice.
  • We need as-stable-as-possible random number sequencing across scenarios and sample rates for households, persons, tours, trips, sub-models, etc.
  • We need stable random numbers with restartable data pipelining

Design Ideas

  • Create our own framework so it can work with orca
  • We are reducing our dependency on orca, but not abandoning it since that would be too expensive
  • Orca tables are being saved to the datastore as pandas data frames and then being wrapped as orca tables on I/O
  • Each household, person, tour, trip, sub-model has a random number stream and offset. For example, when the model runs sub-model A it uses the first offset, sub-model B uses the second offset, and sub-model C uses the third offset. If we restart the model run at sub-model C, it sees in the datastore that sub-models A and B were run and that offsets 1 and 2 have already been used as well.
  • The offsets are by sub-model run order, not sub-model name; this is more flexible and avoids requiring an a priori dictionary

Quick Overview of the In-Development Implementation

Pipeline

The revised model run setup looks like this:

_MODELS = [
    'compute_accessibility',
    'school_location_simulate',
    'workplace_location_simulate',
    'auto_ownership_simulate',
    'cdap_simulate',
    'mandatory_tour_frequency',
    'mandatory_scheduling',
    'non_mandatory_tour_frequency',
    'destination_choice',
    'non_mandatory_scheduling',
    'tour_mode_choice_simulate',
    # 'trip_mode_choice_simulate'
]

#resume_after = 'mandatory_scheduling'
resume_after = None

pipeline.get_rn_generator().set_base_seed(0) #global seed

pipeline.run(models=_MODELS, resume_after=resume_after)

Here is the contents of the data pipeline HDF5 file, which contains the state of pandas DataFrames after each sub-model if they are revised by the sub-model. You can see that the number of columns changes as the sub-models are run.

<class 'pandas.io.pytables.HDFStore'> File path: pipeline.h5

/accessibility/compute_accessibility              (shape->[25,21])

/checkpoints                                      (shape->[12,11])

/households/compute_accessibility                 (shape->[1000,64])
/households/auto_ownership_simulate               (shape->[1000,67])
/households/cdap_simulate                         (shape->[1000,68])

/land_use/compute_accessibility                   (shape->[25,49])

/mandatory_tours/mandatory_tour_frequency         (shape->[766,4])
/mandatory_tours/mandatory_scheduling             (shape->[766,5])

/non_mandatory_tours/non_mandatory_tour_frequency (shape->[1256,4])
/non_mandatory_tours/destination_choice           (shape->[1256,5])
/non_mandatory_tours/non_mandatory_scheduling     (shape->[1256,6])

/persons/compute_accessibility                    (shape->[1549,50])
/persons/school_location_simulate                 (shape->[1549,54])
/persons/workplace_location_simulate              (shape->[1549,59])
/persons/cdap_simulate                            (shape->[1549,64])
/persons/mandatory_tour_frequency                 (shape->[1549,69])
/persons/non_mandatory_tour_frequency             (shape->[1549,72])

/tours/tour_mode_choice_simulate                  (shape->[2022,37])

Random number sequencing

  • Random number generation is done using numpy's Mersenne Twister PNRG
  • ActivitySim uses a stream of random numbers for each household id, person id, tour id, (soon trip id), and model step offset
  • The seed (offset/starting point) is based on the global seed, household id, person id, tour id, (soon trip id), and model step offset. The equation looks something like this:
chooser.index * number of models for chooser + chooser model offset + global seed offset

for example
  household.id * 2 + 0 + 1
where:
  household.id = household table index
  2 = number of household level models - auto ownership and cdap
  0 = first household model - auto ownership
  1 = global seed offset for testing the same model under different random global seeds
  • Tour id is segmented by tour type
  • The sequencing is thread/process safe for eventual multiprocessor support