Large Scale Performance Test #61

danielsclint · 2016-02-29T21:45:30Z

This task needs to be flushed out more in terms of what is expected and what types of benchmarks will be assigned to this. Looking for @bstabler to provide more insight on this over the next couple of weeks.

bstabler · 2016-03-12T01:25:20Z

@DavidOry will post all the MTC TM1 matrices for RSG; I will convert to OMX

We will start with un-commenting out a couple of Fletcher's skims API edits to use just one matrix
Will run tour mode choice and see what performance looks like
Creating a benchmark at this point
Won't refactor code at this point; focused more on review, design ideas, benchmarks, etc.
Travis doesn't support all the matrices, so this is something to consider
Once up and running, will work on some additional testing for the various skim queries:

one OD pair
one O to all Ds
one O to a sample of Ds
all Os to Ds
by mode, time-of-day, etc.
etc.

@DavidOry started on an accessibilities calculator for travel model one that is housed on his github page; this will be useful to review

DavidOry · 2016-03-14T16:07:35Z

Test Skims are available here: https://mtcdrive.app.box.com/activitysim (see Test Skims dir)
My accessibility noodling is here: https://github.com/MetropolitanTransportationCommission/travel-model-one/tree/activity_sim/utilities/access-activitysim.

DavidOry · 2016-03-14T16:09:16Z

Also, I do not believe Fletcher fully built out the model spec files for tour mode choice. For reference, our UECs are here: https://github.com/MetropolitanTransportationCommission/travel-model-one/tree/master/model-files/model.

danielsclint · 2016-03-16T16:02:46Z

Please be aware of #32 as it relates to this issue. They are closely related, but I think that #32 will be directly handled in the Skim Handling task later this Spring.

toliwaga · 2016-03-17T22:51:49Z

Just to quantify the performance issues with 3D skims.

With the current hack that just uses the same in-memory matrix over and over:
Time to execute step 'mode_choice_simulate': 3.24 s

With the hack removed, and reading the matrices from disk:
Time to execute step 'mode_choice_simulate': 96.51 s

danielsclint · 2016-04-07T20:29:34Z

In the 3/25/2016 project meeting, @toliwaga indicated he made revisions to skim.py to test re-reading the distance matrix from disk as opposed to just getting it from memory.

Is @toliwaga comment above the outcome of those tests, or is this something different. Does any code need to be checked in?

What indicates this task as complete?

toliwaga · 2016-04-07T20:33:03Z

With the in-memory hack removed, and reading the matrices from disk, time to execute step 'mode_choice_simulate' went from 3.24 s to 96.51 s

But that is just the tip of the iceberg. There are a number of other cascading shortcuts that make the actual time for a full scale run a whole lot longer than that. There were more such shortcuts than I expected and I have been exploring them, hoping to reach closure in time for tomorrow's meeting but I am happy to update you on my progress so far. Let me just write them up...

danielsclint · 2016-04-07T20:37:09Z

Thanks, I would appreciate it. As someone that is not intimately familiar with the code, I think it will help me be more prepared for our discussion.

toliwaga · 2016-04-07T21:51:34Z

In the current (master) version of activitysim, neither Skims3D nor mode_choice_simulate are fully implemented. The current version runs nice and fast, but there are a lot of compromises in the implementation.

Skims3D provides the framework for two different strategies of skim handling.

1preload: the whole set of skims can be preloaded into a dict and passed to the Skims3D initializer
demand: the skims can be loaded on demand whenever an expression references them.

Neither of these approaches are fully implemented in practice.

I have been exploring the demand approach, which does something that the internal documentation refers to as "being sneaky to make it go faster" This really has several parts. They are:

# being sneaky to make it go faster

Although mode_choice_simulate references many different skims, with implicit key indexing of different time of day layers, the example only ever accesses on skim file (DISTANCE)

Furthermore, as things stand, this file is only read three times (once for each time period defined in settings.yaml) instead of hundreds of times because the stacked skim data is cached and subsequent requests for (different) are 'satisfied' by returning the same old stacked DISTANCE skim from memory. Needless to say, this is a lot faster than re-reading it from disk.

The first thing I did was disable the caching, forcing the skim file (still DISTANCE) to be re-read from the omx file. That accounted for the increase in runtime from 3.24 seconds to 96.51 seconds that I mentioned above.

But it was still reading the some wrong data over and over again. So next, I built a omnibus skim omx file containing all the mtc_tm1 skims for all time periods (using keys like DRIVEALONEFREE__PM) and read them in on demand.

In other words, I made the following change:

def get_from_omx(self, key, v):
    return self.omx['DIST']

def get_from_omx(self, key, v):
    return self.omx[key + '__' + v]

This increased the runtime to 320.71 seconds

This is now at least using the correct skims, but there are few other issues...

# FIXME this only runs eatout

mode_choice_simulate only runs _mode_choice_simulate for tours of type 'eatout', and runs them with dest TAZ 'workplace_taz' instead of the 'destination' taz that the destination.py step assigns to tours for non mandatory tours.

So I fixed mode.py to iterate over the various tour types using the correct dest taz depending on tour type (workplace_taz for work tours, school_taz for school tours, and destination for non mandatory tour types ('shopping' 'othdiscr' 'eatout' 'social' 'othmaint' 'escort')

Unfortunately the processor fails for work and school type tours, which is what I have been looking into today.

Meanwhile, extending it to run all the non mandatory tours (skipping school and work) increased the runtime to 1864.96 seconds (31 minutes).

households_sample_size: 10000

All these time are for runs with households_sample_size set to 10000 in settings.yaml. Increasing it may make things slower, but hopefully not by much if pandas is pulling its weight,

# walk and bike rows commented out in tour_mode_choice.csv

Presumably they were commented out because the DISTWALK and DISTBIKE skims were lacking. Adding them back in will presumably lead to a minor performance hit.

I have a couple of ideas in mind for what to do next, that I would be happy to discuss at the meeting.

Fix work and school tour modes. Then at least we will have a worst case runtime that we can start chipping away at.
We could easily cut the runtime way down by not tearing down and reloading skims when we access the same skim twice in succession. This happens a lot and is a no-brainer, but probably wont be adequate.
Put together a version that reads everything into memory up front. (there is about 1 G of skim data)
Figure out why building the 3DSkims is so slow (because it is NOT i/o time - that is a matter of seconds) Solving this would be a big win for either preload or demand strategies.

But I am very open to any other ideas!

toliwaga · 2016-04-07T22:48:53Z

Ran out of memory in the first step (school_location_simulate) on my 16 gig macbook pro trying to run with household sampling turned off...

bstabler · 2016-04-07T22:54:48Z

Let's try it on our San Diego server - sdmdlppw01, it has 160 GB of RAM

toliwaga · 2016-04-08T21:46:17Z

Adding in work and school mode choice brings runtime to 2383 secs (39.7 minutes)

That's with 10K household sample size and walk and bike rows commented out in tour_mode_choice.csv

bstabler · 2016-04-12T19:29:07Z

Since reading OMX matrices from disk into Python objects is relatively slow, we may try to first pickle all the skims so they are closer to Python native format when being re-read from disk

toliwaga · 2016-04-21T13:32:31Z

I have a version working (in large-scale-test branch) that lazy-loads and retains the skims so they are only read in once, which certainly improves performance. I am working on a version that forces the preload since it is hard to benchmark and instrument in the lazy-load version.

It looks like there is room for significant improvement in both performance and memory footprint.

toliwaga · 2016-04-21T15:37:44Z

preloading all (full scale) mtc skims used by current model (which just uses tods 'AM', 'MD', 'PM') takes 1 min 31 secs and takes 8 gig of memory headroom, without running any models or loading any h5 (person/hh/landuse) data.

So obviously the raw skim load is not what is slowing things down. I suspect the Skim3D stacking is the primary culprit.

As things stand, _mode_choice_simulate is currently the only user of Skim3D class, but this presumably can and will change.

Skim3D.init preloads (copying and stacking the matrices) any skims with tuple keys in the skims parameter, which is apparently expected ordinarily (though not required) to be the skims injectable.

_mode_choice_simulate creates two Skim3D objects, which means it potentially consumes twice as much memory as needed when preloading skims passed in the skims parameter. On the other hand, when it loads skims lazily on demand from the omx file, it only loads the TOD layers that are actually used in the context of the dataframe it is applied to, which, depending on the chronological distribution of trips may be a subset of the available layers. Also, of course, lazy loading only loads the skims that are actually used in the eval expressions.

Large scale test #73, #71, #61, #60

toliwaga · 2016-04-22T14:23:39Z

Benchmarks (memory high water mark and runtime on a 16G Macbook Air) for running:

mtc_tm1_sf_test - 20 TAZ mini dataset for test suite runs on travis
mtc_tm1_sf - 190 TAZ example dataset

Showing times with skim preload vs. lazy load

##################################################

mtc_tm1_sf_test (20 TAZ skims, 1000 hh sample)

##################################################

####### households_sample_size = 1000
####### preload_3d_skims = True

max memory footprint = 0.83 GB
runtime = 1m46.133s
runtime = 1m49.275s
runtime = 1m50.426s
runtime = 1m47.434s

####### households_sample_size = 1000
####### preload_3d_skims = False

max memory footprint = 0.8 GB
runtime = 1m48.111s
runtime = 1m56.341s

##################################################

mtc_tm1_sf (190 TAZ skims, 1000 hh sample)

##################################################

####### households_sample_size = 1000
####### preload_3d_skims = True

max memory footprint = 0.79 GB
runtime = 1m57.849s

####### households_sample_size = 1000
####### preload_3d_skims = False

max memory footprint = 0.61 GB
runtime = 2m18.636s

danielsclint · 2016-04-22T14:26:42Z

This is great!

You said in an earlier comment that the memory footprint had been 8Gb, but this report indicates a max footprint of less than a gig. Is this the difference between the full skims and subset, or something else?

toliwaga · 2016-04-22T14:28:45Z

Benchmarks (memory high water mark and runtime on a 16G Macbook Air) for running:

mtc_tm1 - full 1454 TAZ example dataset

Showing memory and runtimes with skim preload vs. lazy load running with HH sample size of 1000 vs. 10,000

model level runtime breakdown for 10K runs suggests that some models are scaling better than others.

These ones seem to be potential problems:

workplace_location_simulate
tour_mode_choice_simulate
trip_mode_choice_simulate

I am running a 20K sample now to try to get some trend lines...

##################################################

mtc_tm1 (1454 TAZ skims)

##################################################

preload     1K HH            10K HH
------------------------------------
True        10.1 GB          9.6 GB
             14 min          29 min

False       1.62 GB          8.7 GB
             39 min             53m

####### households_sample_size = 1000
####### preload_3d_skims = True

max memory footprint = 10.06 GB
runtime = 13m48.380s

####### households_sample_size = 1000
####### preload_3d_skims = False

max memory footprint = 1.62 GB
runtime = 39m28.545s

####### households_sample_size = 10000
####### preload_3d_skims = True

max memory footprint = 9.62 GB
runtime = 28m58.243s

'school_location_simulate': 18.18 s
'workplace_location_simulate': 543.69 s
'auto_ownership_simulate': 1.26 s
'cdap_simulate': 233.12 s
'mandatory_tour_frequency': 3.32 s
'mandatory_scheduling': 9.18 s
'non_mandatory_tour_frequency': 50.40 s
'destination_choice': 7.94 s
'non_mandatory_scheduling': 18.82 s
'patch_mandatory_tour_destination': 0.83 s
'tour_mode_choice_simulate': 376.99 s
'trip_mode_choice_simulate': 384.50 s

####### households_sample_size = 10000
####### preload_3d_skims = False

max memory footprint = 8.7 GB
runtime = 52m34.548s

'school_location_simulate': 18.26 s
'workplace_location_simulate': 527.39 s
'auto_ownership_simulate': 1.26 s
'cdap_simulate': 213.49 s
'mandatory_tour_frequency': 3.10 s
'mandatory_scheduling': 8.89 s
'non_mandatory_tour_frequency': 47.67 s
'destination_choice': 7.52 s
'non_mandatory_scheduling': 17.82 s
'patch_mandatory_tour_destination': 0.81 s
'tour_mode_choice_simulate': 1112.44 s
'trip_mode_choice_simulate': 1191.05 s

toliwaga · 2016-04-22T14:34:38Z

@danielsclint - yes those are small skims, I just posted data for larger skims. The low memory footprint for the 1K full 1454 TAZ skims shows that the lazy load does save memory compared to full skim preload (1.6GB vs. 10GB) but once the HH sample gets bigger, this advantage declines.

Also, while the full skim preload has a memory high water mark of 10GB, the actual size of the loaded skims is less than 3GB, so a lot of the apparent overhead of preloading is probably due to the garbage collector not running while we cycle through the skims.

toliwaga · 2016-04-22T14:44:44Z

Interestingly, one of the major causes of high initial runtime figures (from two weeks ago) was caused, not by actual skim handling, but by the innocent-looking code fragments in the Skim3D getitem function, which was being called for every skim reference for every tour type (i.e. thousands of times):

        if self.omx:
            # read off the disk on the fly
            self._build_single_3d_matrix_from_disk(key)
...
        if self.omx:
            # and now destroy
            self._tear_down_single_3d_matrix(key)

self.omx in the above code snippets is an omx File object. Since it doesn't have a __bool__ method, __len__ is called instead, which is surprisingly costly when there are a lot of skims in the file.

Changing the above code to say if self.omx is not None made a huge performance difference. It might be worth considering adding a __bool__ method to omx.File

guyrousseau · 2016-04-22T14:48:38Z

... and those are skims presumably from an ABM integrated with static traffic assignment using conventional level-of-service skims. Over time, some of our ABMs will be integrated with dynamic traffic assignment using individual LOS through deep integration of individual trajectories. Just something to keep in mind as this project evolves.

danielsclint · 2016-04-22T15:08:19Z

Second on Guy's comment. We will be looking to decide in the Fall whether we integrate our DTA with CT-RAMP or, hopefully, ActivitySim.

bstabler · 2016-04-23T00:30:10Z

As we suspected, the full example (all zones (1450) and all HHs (2.7m)) appears to have gotten stuck in work location choice. I first setup and ran the 20 zone example on our 20 core 160GB San Diego Windows Server, and it ran within a couple minutes. Next, I swapped in the full example inputs, and it ran through pre-loading of skims and school location choice in around 15 to 30 minutes, but has now been in work location choice for a couple hours. The max memory usage during school location was ~125GB. It is probably stuck in the roundtrip_auto_time_to_work virtual column issue that @toliwaga mentioned on the call. I'm going to let it run over the weekend just for fun.

toliwaga · 2016-04-26T14:35:36Z

I reopened #72 because it seems to be related to the "gotten stuck" @bstabler refers to above (or at least a similar phenomenon I encounter.

I get some very peculiar behavior at the end of workplace_location_simulate, after adding workplace_taz column and calling add_dfependent_columns:

    orca.add_column("persons", "workplace_taz", choices)
    add_dependent_columns("persons", "persons_workplace")

This ends up calling distance_to_work, which never returns. Maybe the apparently redundant workplace_taz column in persons.py was serving some arcane function that we didn't grok...

I don't fully understand what is happening yet, but I notice that distance_to_work depends on workplace_taz and in fact it is nor returning from that call, stuck in a reindex call trying to crate a pandas series.

def distance_to_work(persons, distance_skim):
    return pd.Series(distance_skim.get(persons.home_taz,
                                       persons.workplace_taz),
                     index=persons.index)

traceback.extract_stack()

 00 = {tuple} ('/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py', 2411, '<module>', "globals = debugger.run(setup['file'], None, None, is_module)")
 01 = {tuple} ('/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py', 1802, 'run', 'launch(file, globals, locals)  # execute the script')
 02 = {tuple} ('/Users/jeff.doyle/work/activitysim/sandbox/simulation.py', 76, '<module>', 'orca.run(["workplace_location_simulate"])')
 03 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py', 1876, 'run', 'step()')
 04 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py', 780, '__call__', 'return self._func(**kwargs)')
 05 = {tuple} ('/Users/jeff.doyle/work/activitysim/activitysim/defaults/models/workplace_location.py', 54, 'workplace_location_simulate', 'add_dependent_columns("persons", "persons_workplace")')
 06 = {tuple} ('/Users/jeff.doyle/work/activitysim/activitysim/defaults/models/util/misc.py', 11, 'add_dependent_columns', 'orca.add_column(base_dfname, col, tbl[col])')
 07 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py', 284, '__getitem__', 'return self.get_column(key)')
 08 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py', 275, 'get_column', 'column = extra_cols[column_name]()')
 09 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py', 622, '__call__', 'col = self._func(**kwargs)')
 10 = {tuple} ('/Users/jeff.doyle/work/activitysim/activitysim/defaults/tables/persons.py', 265, 'distance_to_work', 'persons.workplace_taz),')
 11 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/generic.py', 1333, 'get', 'return self[key]')
 12 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/series.py', 601, '__getitem__', 'return self._get_with(key)')
 13 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/series.py', 633, '_get_with', 'return self.reindex(key)')
 14 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/series.py', 2344, 'reindex', 'return super(Series, self).reindex(index=index, **kwargs)')
 15 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/generic.py', 2226, 'reindex', 'fill_value, copy).__finalize__(self)')
 16 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/generic.py', 2239, '_reindex_axes', 'tolerance=tolerance, method=method)')
 17 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/indexes/base.py', 2259, 'reindex', 'indexer, missing = self.get_indexer_non_unique(target)')
 18 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/indexes/base.py', 2122, 'get_indexer_non_unique', 'indexer, missing = self._engine.get_indexer_non_unique(tgt_values)')
 19 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/numpy/core/fromnumeric.py', 1150, 'resize', 'if extra > 0:')

toliwaga · 2016-04-28T12:26:46Z

The hang was caused by skims.py injectables distance_skim (and friends) attempting to return the Skim by using the square-bracket array getter (implemented by Skims.__getitem__) which is NOT symmetrical with Skims.__getitem__ and which does NOT return a Skim, but instead calls lookup and returns a Series.

This caused some very squirrelly behavior in which pandas spent a lot of time needlessly reshaping the requested objects, sometimes indefinitely.

toliwaga · 2016-04-29T21:53:14Z

This was fixed by 6f3ce30 but still running out of memory in workplace_location_simulate

Running step 'workplace_location_simulate'
Traceback (most recent call last):
  File "simulation.py", line 9, in <module>
    orca.run(["workplace_location_simulate"])
  File "C:\Anaconda2\lib\site-packages\orca\orca.py", line 1876, in run
    step()
  File "C:\Anaconda2\lib\site-packages\orca\orca.py", line 780, in __call__
    return self._func(**kwargs)
  File "e:\projects\asim\activitysim\activitysim\defaults\models\workplace_locat
ion.py", line 47, in workplace_location_simulate
    sample_size=50)
  File "e:\projects\asim\activitysim\activitysim\activitysim.py", line 279, in i
nteraction_simulate
    df = interaction_dataset(choosers, alternatives, sample_size)
  File "e:\projects\asim\activitysim\activitysim\mnl.py", line 117, in interacti
on_dataset
    suffixes=('', '_r'))
  File "C:\Anaconda2\lib\site-packages\pandas\tools\merge.py", line 35, in merge

toliwaga · 2016-04-29T22:27:38Z

The out of memory in workplace_location_simulate is a design issue.

workplace_location_simulate calls interaction_simulate

choosers = persons_merged.to_frame()
alternatives = destination_size_terms.to_frame()

choices, _ = asim.interaction_simulate(choosers,
                                       alternatives,
                                       workplace_location_spec,
                                       skims=skims,
                                       locals_d=locals_d,
                                       sample_size=50)

interaction_simulate calls interaction_dataset to build a dataframe of choosers joined with a sample of alternative TAZ destinations with all columns for each table

Specifically sample_size rows for each person in persons merged with 182 columns from persons_merged and 19 columns from destination_choice_size_terms

This fully normalized table of persons_merged X alts_sample has 199 columns and is pretty big:

(7,000,000 persons_merged) * (50 alternatives) * (199 columns) * (8 bytes per columns)

7000000 * 50 * 199 * 8/ (1024_1024_1024) = 518.93 GB

No way the will merge will succeed even with 170 Gig or RAM.

toliwaga · 2016-04-29T22:31:18Z

One way to handle this, at least in the near term, is to chunk the computation (say into tranches of 500K persons) so it fits in memory.

There is a lot to be said of retaining the simplicity of a fully normalized interaction_dataset.

On the other hand, this already (even if we chunk the computation to fit in memory) comes at the cost of using a random sample of 50 candidate destinations.

wusun2 · 2016-04-29T23:11:48Z

Eventually, when we start distributing choices among multiple machines (a task in the future?), the population needs to be chunked anyway.

fscottfoti · 2016-04-29T23:31:04Z

I think you're hinting at this in your comment, but in UrbanSim we make the alternative compromise. We chunk people into discrete types (compromise = losing continuous variables) and we have 20 person "types" we interact with the whole choice set which might be 20k zones in this case.

(20 discretized persons) * (20,000 alternatives) * (199 columns) * (8 bytes per column)

20 * 20000 * 199 * 8/ (102410241024) = 636MB is I'm not mistaken. Which leaves you a lot of room to add detail in the how you discretize the persons data.

This is my pet approach because it also allows you to view (map) the PDF for each of the discretized persons which gives a ton of transparency to the model that I don't think comes through by looking at coefficients. Anyway, I know you know this but just though I'd voice the thought.

And of course you sample from the PDF for all the persons of each type which is trivially fast.

Also is there any reason to use doubles (large choice choice sets and small probabilities?)? Seems like using floats might could halve the memory use right there...

fscottfoti · 2016-04-29T23:35:15Z

I guess the O-D problem kills you here too as you have 20 types x 20k origin zones to the 20k destination zones and then you're almost exactly back where you started. Does that sound right?

toliwaga · 2016-04-30T13:17:32Z

@fscottfoti yes, I agree it is a nice strategy, but that the OD thing may mess it up.

I am wondering if it might be possible to tweak the algorithm so that it is not necessary to actually do this join but use some kind of indirection, maybe in an apply, instead.

fscottfoti · 2016-05-02T16:30:21Z

It's certainly appealing but I don't think the apply approach will work - this gets back into just running Python loops and will probably kill the performance (which sounds bad already). But worth trying to know what you're working with.

Are all 199 columns used in the utility calculation or maybe you can force the user to ask for the subset that they will need. (We also wrote something in UrbanSim that looks for column names in the spec strings and then only fetches those columns, which worked well). I don't think this will get you all the way there but it could help.

bstabler · 2016-05-05T20:16:30Z

We probably need to batch (or chunk) the choosers in some way in order to manage our expectations about memory requirements. Let's look at a few cases for how the overall model could operate:

run just a small sample of HHs, say 1000, or about 2500 persons, in which case 2500 work locations by 50 alternatives is not too much data
run 2.7m HHs on one machine, in which case work location requires a massive data table
run 2.7m HHs on one machine, but in separate batches in order to not use too much RAM. One key question then is where to create the batches:

within the MNL class
one step up in the work location model
further up in the overall run process
or even further up by creating separate simulation runs, each with a set of HHs for example
4) run 2.7m HHs across a cluster of machines, in which case each machine is processing some fixed amount of choosers and all the models can be run within the RAM available

In addition, we don't want to require batching by HHs since some models operate on persons, tours, trips, etc. If we add it to the MNL class, then we'll need to add it to a number of other models as well. If we want to restrict the design to batch process HHs, then we could add it higher up in the stack. I think we need to think about this some more and probably document the data tables, their sizes, etc for each sub-model.

bstabler · 2016-05-08T15:26:26Z

The CDAP model needs to be chunked as well

Running step 'cdap_simulate'
Traceback (most recent call last):
  File "simulation.py", line 12, in <module>
    orca.run(["cdap_simulate"])
  File "C:\Anaconda2\lib\site-packages\orca\orca.py", line 1876, in run
    step()
  File "C:\Anaconda2\lib\site-packages\orca\orca.py", line 780, in __call__
    return self._func(**kwargs)
  File "e:\projects\asim\activitysim\activitysim\defaults\models\cdap.py", line 64, in cdap_simulate
    cdap_all_people)
  File "e:\projects\asim\activitysim\activitysim\cdap\cdap.py", line 430, in run_cdap
    hh_utils = initial_household_utilities(ind_utils, people, hh_id_col)
  File "e:\projects\asim\activitysim\activitysim\cdap\cdap.py", line 197, in initial_household_utilities
    tz.concat(itertools.product(range(len(alts)), repeat=hh_size)))
MemoryError

toliwaga · 2016-05-16T15:49:11Z

Chunking will a bit trickier since it has to chunk at HH granularity. I had to take some time off last week for another project but now working on this.

toliwaga · 2016-05-18T11:57:21Z

It looks like we also need to chunk to calls to interaction_simulate in vectorize_tour_scheduling

toliwaga · 2016-05-20T01:05:43Z

I requisitioned an underutilized machine and have been working on setting up a ubuntu large scale test server with a terrabyte of disk and 128G Ram for running full scale tests. I am running the full scale model on it to see if I can replicate a somewhat confusing error Ben encountered running on the windows server. It may just be a fencepost error in my hh_chunking code. But in any case it will be nice to have full scale test servers running under both windows and linux as we move forward...

bstabler · 2016-09-09T21:53:28Z

this is stuck on #116 right now

toliwaga · 2016-10-19T16:43:29Z

Full run completed successfully in 9 hours and 25 minutes on modelling server with chunk size of 100K. (2,732,722 households, 7,053,334 persons)

Rewrite of interaction_simulate reduced runtime of non_mandatory_tour_frequency from 404 minutes to 109 minutes.

                                 seconds       minutes
compute_accessibility                216           3.6
school_location_simulate            1020          17.0
workplace_location_simulate         2710          45.2
auto_ownership_simulate              177           3.0
cdap_simulate                       3190          53.2
mandatory_tour_frequency             497           8.3
mandatory_scheduling                1590          26.5
non_mandatory_tour_frequency        6563         109.4
destination_choice                  2411          40.2
non_mandatory_scheduling            3066          51.1
patch_mandatory_tour_destination     127           2.1
tour_mode_choice_simulate           6387         106.5
trip_mode_choice_simulate           5653          94.2
------------------------------------------------------
all models                         33967         566.1  (9 hours and 25 minutes)

danielsclint · 2016-10-19T16:55:20Z

Single-thread, right?

toliwaga · 2016-10-19T16:56:26Z

Yes, Single-thread.

wusun2 · 2016-10-19T17:20:50Z

That's encouraging. Here is a quick comparison of SANDAG 2012 performance:
Households: 1,086,628, Pop: 3,143,418; CT-RAMP part runtime of ONE iteration of 100% sample (your is also one iteration?): 15 hrs and 6 mins.

This is not an apple to apple comparison though; the unimplemented features in the current ActivitySim (logsum etc.) as well as SANDAG specific features and complexities. However, your test has a much larger population.

bstabler · 2016-10-23T19:08:59Z

I'm going to close this issue now since we have a complete run. I'm sure we'll run into additional issues, but we'll make those individual issues instead.

danielsclint added this to the Phase 2: Architecture Review milestone Feb 29, 2016

danielsclint changed the title ~~Large Scale Performance Test (Task 2 Deliverable)~~ Large Scale Performance Test Feb 29, 2016

danielsclint mentioned this issue Apr 7, 2016

Test Re-reading the distance matrix from disk #70

Closed

bstabler added a commit that referenced this issue Apr 21, 2016

Merge pull request #78 from UDST/large-scale-test

40117b3

Large scale test #73, #71, #61, #60

bstabler added the significant issue label Jul 1, 2016

bstabler mentioned this issue Aug 2, 2016

implement chunking to reduce memory overhead RSGInc/bca4abm#64

Closed

bstabler closed this as completed Oct 23, 2016

bstabler modified the milestones: Phase 2: Skim Handling, Phase 2: Architecture Review Dec 20, 2016

Large Scale Performance Test #61

Large Scale Performance Test #61

Comments

danielsclint commented Feb 29, 2016

bstabler commented Mar 12, 2016

DavidOry commented Mar 14, 2016

DavidOry commented Mar 14, 2016

danielsclint commented Mar 16, 2016

toliwaga commented Mar 17, 2016

danielsclint commented Apr 7, 2016

toliwaga commented Apr 7, 2016

danielsclint commented Apr 7, 2016

toliwaga commented Apr 7, 2016

toliwaga commented Apr 7, 2016

bstabler commented Apr 7, 2016

toliwaga commented Apr 8, 2016

bstabler commented Apr 12, 2016

toliwaga commented Apr 21, 2016

toliwaga commented Apr 21, 2016

toliwaga commented Apr 22, 2016

mtc_tm1_sf_test (20 TAZ skims, 1000 hh sample)

mtc_tm1_sf (190 TAZ skims, 1000 hh sample)

danielsclint commented Apr 22, 2016

toliwaga commented Apr 22, 2016

mtc_tm1 (1454 TAZ skims)

toliwaga commented Apr 22, 2016

toliwaga commented Apr 22, 2016 • edited

guyrousseau commented Apr 22, 2016

danielsclint commented Apr 22, 2016

bstabler commented Apr 23, 2016

toliwaga commented Apr 26, 2016 • edited

toliwaga commented Apr 28, 2016

toliwaga commented Apr 29, 2016

toliwaga commented Apr 29, 2016

toliwaga commented Apr 29, 2016 • edited

wusun2 commented Apr 29, 2016

fscottfoti commented Apr 29, 2016

fscottfoti commented Apr 29, 2016

toliwaga commented Apr 30, 2016

fscottfoti commented May 2, 2016

bstabler commented May 5, 2016

bstabler commented May 8, 2016

toliwaga commented May 16, 2016

toliwaga commented May 18, 2016

toliwaga commented May 20, 2016

bstabler commented Sep 9, 2016

toliwaga commented Oct 19, 2016 • edited

danielsclint commented Oct 19, 2016 • edited by bstabler

toliwaga commented Oct 19, 2016

wusun2 commented Oct 19, 2016 • edited

bstabler commented Oct 23, 2016

toliwaga commented Apr 22, 2016 •

edited

toliwaga commented Apr 26, 2016 •

edited

toliwaga commented Apr 29, 2016 •

edited

toliwaga commented Oct 19, 2016 •

edited

danielsclint commented Oct 19, 2016 •

edited by bstabler

wusun2 commented Oct 19, 2016 •

edited