Multi Processing - ActivitySim/activitysim GitHub Wiki

This is all just prototype stuff now, but the general idea is to implement multiprocessing in as non-invasive way as possible, at least at first.

So we keep the run list, but we annotate it using a second structure (multiprocess_steps) that indicates how to intervene.

multiprocess_steps is an array of dicts. Each step multiprocess_step consists of one or models that can be run in sequence either as a single process, or multiprocessed with each process handling a subset of the model data.

label is a string used for logging and tagging output files.

Each step represents a set of model steps identified by the 'begin' key which names the first model step. To avoid redundency, the last model in the set is implicit: up to but not including the first model in the next multiprocess_steps (or the rest of the models for the last step.)

slice implicitly identifies a step as multiprocess. It contains instruction on how to slice the model data so that the different segments can be processed independently. Usually, this would be segmentation by household (all persons must appear in the same segment because of intra-household dependencies.) However, the other segmentations are possible. The most obvious being segmentation by zone for the accessibility calculation. However, since the mtctm1 accessibility calculation is fast, we don't segment it in the example. the slice.tables entry contains a list of slicers to use to segment the household, the first entry being primary, followed by additional cascading dependencies (e.g. persons segmentation depends on households) following standard activitysim index_name/referring_column conventions. There is also the option of specifying a slice.except list to exclude tables from segmentation. (e.g. to avoid slicing the land_use table when calculating accessibility.)

num_processes indicates the number of processors to devote to the step. It is an error for single-process steps to specify more than 1 processor, or multi-process steps to specify less than 2. If not specified, the default value is 1 for single-process, and cpu_count for multi-process.

chunk_size specifies a custom chunk size for the step. If no specified, then the global chunk size is used, but for multiprocess, it is divided by the number of processes so that the total chunk size across processes totals to global chunk_size.

chunk_size: 4000000000

multiprocess_steps:
  - label: mp_initialize
    begin: initialize_landuse
  - label: mp_households
    begin: _school_location_sample
    num_processes: 3
    chunk_size: 1000000000
    slice:
      tables:
        - households
        - persons
  - label: mp_summarize
    begin: write_data_dictionary