Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Scale Performance Test #61

Closed
danielsclint opened this issue Feb 29, 2016 · 44 comments
Closed

Large Scale Performance Test #61

danielsclint opened this issue Feb 29, 2016 · 44 comments

Comments

@danielsclint
Copy link

This task needs to be flushed out more in terms of what is expected and what types of benchmarks will be assigned to this. Looking for @bstabler to provide more insight on this over the next couple of weeks.

@danielsclint danielsclint changed the title Large Scale Performance Test (Task 2 Deliverable) Large Scale Performance Test Feb 29, 2016
@bstabler
Copy link
Contributor

@DavidOry will post all the MTC TM1 matrices for RSG; I will convert to OMX

  • We will start with un-commenting out a couple of Fletcher's skims API edits to use just one matrix
  • Will run tour mode choice and see what performance looks like
  • Creating a benchmark at this point
  • Won't refactor code at this point; focused more on review, design ideas, benchmarks, etc.
  • Travis doesn't support all the matrices, so this is something to consider
  • Once up and running, will work on some additional testing for the various skim queries:
  1. one OD pair
  2. one O to all Ds
  3. one O to a sample of Ds
  4. all Os to Ds
  5. by mode, time-of-day, etc.
  6. etc.
  • @DavidOry started on an accessibilities calculator for travel model one that is housed on his github page; this will be useful to review

@DavidOry
Copy link

@DavidOry
Copy link

Also, I do not believe Fletcher fully built out the model spec files for tour mode choice. For reference, our UECs are here: https://github.com/MetropolitanTransportationCommission/travel-model-one/tree/master/model-files/model.

@danielsclint
Copy link
Author

Please be aware of #32 as it relates to this issue. They are closely related, but I think that #32 will be directly handled in the Skim Handling task later this Spring.

@toliwaga
Copy link
Contributor

Just to quantify the performance issues with 3D skims.

With the current hack that just uses the same in-memory matrix over and over:
Time to execute step 'mode_choice_simulate': 3.24 s

With the hack removed, and reading the matrices from disk:
Time to execute step 'mode_choice_simulate': 96.51 s

@danielsclint
Copy link
Author

In the 3/25/2016 project meeting, @toliwaga indicated he made revisions to skim.py to test re-reading the distance matrix from disk as opposed to just getting it from memory.

Is @toliwaga comment above the outcome of those tests, or is this something different. Does any code need to be checked in?

What indicates this task as complete?

@toliwaga
Copy link
Contributor

toliwaga commented Apr 7, 2016

With the in-memory hack removed, and reading the matrices from disk, time to execute step 'mode_choice_simulate' went from 3.24 s to 96.51 s

But that is just the tip of the iceberg. There are a number of other cascading shortcuts that make the actual time for a full scale run a whole lot longer than that. There were more such shortcuts than I expected and I have been exploring them, hoping to reach closure in time for tomorrow's meeting but I am happy to update you on my progress so far. Let me just write them up...

@danielsclint
Copy link
Author

Thanks, I would appreciate it. As someone that is not intimately familiar with the code, I think it will help me be more prepared for our discussion.

@toliwaga
Copy link
Contributor

toliwaga commented Apr 7, 2016

In the current (master) version of activitysim, neither Skims3D nor mode_choice_simulate are fully implemented. The current version runs nice and fast, but there are a lot of compromises in the implementation.

Skims3D provides the framework for two different strategies of skim handling.

  • 1preload: the whole set of skims can be preloaded into a dict and passed to the Skims3D initializer
  • demand: the skims can be loaded on demand whenever an expression references them.

Neither of these approaches are fully implemented in practice.

I have been exploring the demand approach, which does something that the internal documentation refers to as "being sneaky to make it go faster" This really has several parts. They are:

# being sneaky to make it go faster

Although mode_choice_simulate references many different skims, with implicit key indexing of different time of day layers, the example only ever accesses on skim file (DISTANCE)

Furthermore, as things stand, this file is only read three times (once for each time period defined in settings.yaml) instead of hundreds of times because the stacked skim data is cached and subsequent requests for (different) are 'satisfied' by returning the same old stacked DISTANCE skim from memory. Needless to say, this is a lot faster than re-reading it from disk.

The first thing I did was disable the caching, forcing the skim file (still DISTANCE) to be re-read from the omx file. That accounted for the increase in runtime from 3.24 seconds to 96.51 seconds that I mentioned above.

But it was still reading the some wrong data over and over again. So next, I built a omnibus skim omx file containing all the mtc_tm1 skims for all time periods (using keys like DRIVEALONEFREE__PM) and read them in on demand.

In other words, I made the following change:

def get_from_omx(self, key, v):
    return self.omx['DIST']

def get_from_omx(self, key, v):
    return self.omx[key + '__' + v]

This increased the runtime to 320.71 seconds

This is now at least using the correct skims, but there are few other issues...

# FIXME this only runs eatout

mode_choice_simulate only runs _mode_choice_simulate for tours of type 'eatout', and runs them with dest TAZ 'workplace_taz' instead of the 'destination' taz that the destination.py step assigns to tours for non mandatory tours.

So I fixed mode.py to iterate over the various tour types using the correct dest taz depending on tour type (workplace_taz for work tours, school_taz for school tours, and destination for non mandatory tour types ('shopping' 'othdiscr' 'eatout' 'social' 'othmaint' 'escort')

Unfortunately the processor fails for work and school type tours, which is what I have been looking into today.

Meanwhile, extending it to run all the non mandatory tours (skipping school and work) increased the runtime to 1864.96 seconds (31 minutes).

households_sample_size: 10000

All these time are for runs with households_sample_size set to 10000 in settings.yaml. Increasing it may make things slower, but hopefully not by much if pandas is pulling its weight,

# walk and bike rows commented out in tour_mode_choice.csv

Presumably they were commented out because the DISTWALK and DISTBIKE skims were lacking. Adding them back in will presumably lead to a minor performance hit.

I have a couple of ideas in mind for what to do next, that I would be happy to discuss at the meeting.

  1. Fix work and school tour modes. Then at least we will have a worst case runtime that we can start chipping away at.
  2. We could easily cut the runtime way down by not tearing down and reloading skims when we access the same skim twice in succession. This happens a lot and is a no-brainer, but probably wont be adequate.
  3. Put together a version that reads everything into memory up front. (there is about 1 G of skim data)
  4. Figure out why building the 3DSkims is so slow (because it is NOT i/o time - that is a matter of seconds) Solving this would be a big win for either preload or demand strategies.

But I am very open to any other ideas!

@toliwaga
Copy link
Contributor

toliwaga commented Apr 7, 2016

Ran out of memory in the first step (school_location_simulate) on my 16 gig macbook pro trying to run with household sampling turned off...

@bstabler
Copy link
Contributor

bstabler commented Apr 7, 2016

Let's try it on our San Diego server - sdmdlppw01, it has 160 GB of RAM

@toliwaga
Copy link
Contributor

toliwaga commented Apr 8, 2016

Adding in work and school mode choice brings runtime to 2383 secs (39.7 minutes)

That's with 10K household sample size and walk and bike rows commented out in tour_mode_choice.csv

@bstabler
Copy link
Contributor

Since reading OMX matrices from disk into Python objects is relatively slow, we may try to first pickle all the skims so they are closer to Python native format when being re-read from disk

@toliwaga
Copy link
Contributor

I have a version working (in large-scale-test branch) that lazy-loads and retains the skims so they are only read in once, which certainly improves performance. I am working on a version that forces the preload since it is hard to benchmark and instrument in the lazy-load version.

It looks like there is room for significant improvement in both performance and memory footprint.

@toliwaga
Copy link
Contributor

preloading all (full scale) mtc skims used by current model (which just uses tods 'AM', 'MD', 'PM') takes 1 min 31 secs and takes 8 gig of memory headroom, without running any models or loading any h5 (person/hh/landuse) data.

So obviously the raw skim load is not what is slowing things down. I suspect the Skim3D stacking is the primary culprit.

As things stand, _mode_choice_simulate is currently the only user of Skim3D class, but this presumably can and will change.

Skim3D.init preloads (copying and stacking the matrices) any skims with tuple keys in the skims parameter, which is apparently expected ordinarily (though not required) to be the skims injectable.

_mode_choice_simulate creates two Skim3D objects, which means it potentially consumes twice as much memory as needed when preloading skims passed in the skims parameter. On the other hand, when it loads skims lazily on demand from the omx file, it only loads the TOD layers that are actually used in the context of the dataframe it is applied to, which, depending on the chronological distribution of trips may be a subset of the available layers. Also, of course, lazy loading only loads the skims that are actually used in the eval expressions.

bstabler added a commit that referenced this issue Apr 21, 2016
@toliwaga
Copy link
Contributor

Benchmarks (memory high water mark and runtime on a 16G Macbook Air) for running:

mtc_tm1_sf_test - 20 TAZ mini dataset for test suite runs on travis
mtc_tm1_sf - 190 TAZ example dataset

Showing times with skim preload vs. lazy load

##################################################

mtc_tm1_sf_test (20 TAZ skims, 1000 hh sample)

##################################################

####### households_sample_size = 1000
####### preload_3d_skims = True

max memory footprint = 0.83 GB
runtime = 1m46.133s
runtime = 1m49.275s
runtime = 1m50.426s
runtime = 1m47.434s

####### households_sample_size = 1000
####### preload_3d_skims = False

max memory footprint = 0.8 GB
runtime = 1m48.111s
runtime = 1m56.341s

##################################################

mtc_tm1_sf (190 TAZ skims, 1000 hh sample)

##################################################

####### households_sample_size = 1000
####### preload_3d_skims = True

max memory footprint = 0.79 GB
runtime = 1m57.849s

####### households_sample_size = 1000
####### preload_3d_skims = False

max memory footprint = 0.61 GB
runtime = 2m18.636s

@danielsclint
Copy link
Author

This is great!

You said in an earlier comment that the memory footprint had been 8Gb, but this report indicates a max footprint of less than a gig. Is this the difference between the full skims and subset, or something else?

@toliwaga
Copy link
Contributor

Benchmarks (memory high water mark and runtime on a 16G Macbook Air) for running:

mtc_tm1 - full 1454 TAZ example dataset

Showing memory and runtimes with skim preload vs. lazy load running with HH sample size of 1000 vs. 10,000

model level runtime breakdown for 10K runs suggests that some models are scaling better than others.

These ones seem to be potential problems:

workplace_location_simulate
tour_mode_choice_simulate
trip_mode_choice_simulate

I am running a 20K sample now to try to get some trend lines...

##################################################

mtc_tm1 (1454 TAZ skims)

##################################################

preload     1K HH            10K HH
------------------------------------
True        10.1 GB          9.6 GB
             14 min          29 min

False       1.62 GB          8.7 GB
             39 min             53m

####### households_sample_size = 1000
####### preload_3d_skims = True

max memory footprint = 10.06 GB
runtime = 13m48.380s

####### households_sample_size = 1000
####### preload_3d_skims = False

max memory footprint = 1.62 GB
runtime = 39m28.545s

####### households_sample_size = 10000
####### preload_3d_skims = True

max memory footprint = 9.62 GB
runtime = 28m58.243s

'school_location_simulate': 18.18 s
'workplace_location_simulate': 543.69 s
'auto_ownership_simulate': 1.26 s
'cdap_simulate': 233.12 s
'mandatory_tour_frequency': 3.32 s
'mandatory_scheduling': 9.18 s
'non_mandatory_tour_frequency': 50.40 s
'destination_choice': 7.94 s
'non_mandatory_scheduling': 18.82 s
'patch_mandatory_tour_destination': 0.83 s
'tour_mode_choice_simulate': 376.99 s
'trip_mode_choice_simulate': 384.50 s

####### households_sample_size = 10000
####### preload_3d_skims = False

max memory footprint = 8.7 GB
runtime = 52m34.548s

'school_location_simulate': 18.26 s
'workplace_location_simulate': 527.39 s
'auto_ownership_simulate': 1.26 s
'cdap_simulate': 213.49 s
'mandatory_tour_frequency': 3.10 s
'mandatory_scheduling': 8.89 s
'non_mandatory_tour_frequency': 47.67 s
'destination_choice': 7.52 s
'non_mandatory_scheduling': 17.82 s
'patch_mandatory_tour_destination': 0.81 s
'tour_mode_choice_simulate': 1112.44 s
'trip_mode_choice_simulate': 1191.05 s

@toliwaga
Copy link
Contributor

@danielsclint - yes those are small skims, I just posted data for larger skims. The low memory footprint for the 1K full 1454 TAZ skims shows that the lazy load does save memory compared to full skim preload (1.6GB vs. 10GB) but once the HH sample gets bigger, this advantage declines.

Also, while the full skim preload has a memory high water mark of 10GB, the actual size of the loaded skims is less than 3GB, so a lot of the apparent overhead of preloading is probably due to the garbage collector not running while we cycle through the skims.

@toliwaga
Copy link
Contributor

toliwaga commented Apr 22, 2016

Interestingly, one of the major causes of high initial runtime figures (from two weeks ago) was caused, not by actual skim handling, but by the innocent-looking code fragments in the Skim3D getitem function, which was being called for every skim reference for every tour type (i.e. thousands of times):

        if self.omx:
            # read off the disk on the fly
            self._build_single_3d_matrix_from_disk(key)
...
        if self.omx:
            # and now destroy
            self._tear_down_single_3d_matrix(key)

self.omx in the above code snippets is an omx File object. Since it doesn't have a __bool__ method, __len__ is called instead, which is surprisingly costly when there are a lot of skims in the file.

Changing the above code to say if self.omx is not None made a huge performance difference. It might be worth considering adding a __bool__ method to omx.File

@guyrousseau
Copy link

... and those are skims presumably from an ABM integrated with static traffic assignment using conventional level-of-service skims. Over time, some of our ABMs will be integrated with dynamic traffic assignment using individual LOS through deep integration of individual trajectories. Just something to keep in mind as this project evolves.

@danielsclint
Copy link
Author

Second on Guy's comment. We will be looking to decide in the Fall whether we integrate our DTA with CT-RAMP or, hopefully, ActivitySim.

@bstabler
Copy link
Contributor

As we suspected, the full example (all zones (1450) and all HHs (2.7m)) appears to have gotten stuck in work location choice. I first setup and ran the 20 zone example on our 20 core 160GB San Diego Windows Server, and it ran within a couple minutes. Next, I swapped in the full example inputs, and it ran through pre-loading of skims and school location choice in around 15 to 30 minutes, but has now been in work location choice for a couple hours. The max memory usage during school location was ~125GB. It is probably stuck in the roundtrip_auto_time_to_work virtual column issue that @toliwaga mentioned on the call. I'm going to let it run over the weekend just for fun.

@toliwaga
Copy link
Contributor

toliwaga commented Apr 26, 2016

I reopened #72 because it seems to be related to the "gotten stuck" @bstabler refers to above (or at least a similar phenomenon I encounter.

I get some very peculiar behavior at the end of workplace_location_simulate, after adding workplace_taz column and calling add_dfependent_columns:

    orca.add_column("persons", "workplace_taz", choices)
    add_dependent_columns("persons", "persons_workplace")

This ends up calling distance_to_work, which never returns. Maybe the apparently redundant workplace_taz column in persons.py was serving some arcane function that we didn't grok...

I don't fully understand what is happening yet, but I notice that distance_to_work depends on workplace_taz and in fact it is nor returning from that call, stuck in a reindex call trying to crate a pandas series.

def distance_to_work(persons, distance_skim):
    return pd.Series(distance_skim.get(persons.home_taz,
                                       persons.workplace_taz),
                     index=persons.index)
traceback.extract_stack()

 00 = {tuple} ('/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py', 2411, '<module>', "globals = debugger.run(setup['file'], None, None, is_module)")
 01 = {tuple} ('/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py', 1802, 'run', 'launch(file, globals, locals)  # execute the script')
 02 = {tuple} ('/Users/jeff.doyle/work/activitysim/sandbox/simulation.py', 76, '<module>', 'orca.run(["workplace_location_simulate"])')
 03 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py', 1876, 'run', 'step()')
 04 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py', 780, '__call__', 'return self._func(**kwargs)')
 05 = {tuple} ('/Users/jeff.doyle/work/activitysim/activitysim/defaults/models/workplace_location.py', 54, 'workplace_location_simulate', 'add_dependent_columns("persons", "persons_workplace")')
 06 = {tuple} ('/Users/jeff.doyle/work/activitysim/activitysim/defaults/models/util/misc.py', 11, 'add_dependent_columns', 'orca.add_column(base_dfname, col, tbl[col])')
 07 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py', 284, '__getitem__', 'return self.get_column(key)')
 08 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py', 275, 'get_column', 'column = extra_cols[column_name]()')
 09 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/orca/orca.py', 622, '__call__', 'col = self._func(**kwargs)')
 10 = {tuple} ('/Users/jeff.doyle/work/activitysim/activitysim/defaults/tables/persons.py', 265, 'distance_to_work', 'persons.workplace_taz),')
 11 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/generic.py', 1333, 'get', 'return self[key]')
 12 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/series.py', 601, '__getitem__', 'return self._get_with(key)')
 13 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/series.py', 633, '_get_with', 'return self.reindex(key)')
 14 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/series.py', 2344, 'reindex', 'return super(Series, self).reindex(index=index, **kwargs)')
 15 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/generic.py', 2226, 'reindex', 'fill_value, copy).__finalize__(self)')
 16 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/core/generic.py', 2239, '_reindex_axes', 'tolerance=tolerance, method=method)')
 17 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/indexes/base.py', 2259, 'reindex', 'indexer, missing = self.get_indexer_non_unique(target)')
 18 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/pandas/indexes/base.py', 2122, 'get_indexer_non_unique', 'indexer, missing = self._engine.get_indexer_non_unique(tgt_values)')
 19 = {tuple} ('/Users/jeff.doyle/anaconda/envs/asim/lib/python2.7/site-packages/numpy/core/fromnumeric.py', 1150, 'resize', 'if extra > 0:')

@toliwaga
Copy link
Contributor

The hang was caused by skims.py injectables distance_skim (and friends) attempting to return the Skim by using the square-bracket array getter (implemented by Skims.__getitem__) which is NOT symmetrical with Skims.__getitem__ and which does NOT return a Skim, but instead calls lookup and returns a Series.

This caused some very squirrelly behavior in which pandas spent a lot of time needlessly reshaping the requested objects, sometimes indefinitely.

@toliwaga
Copy link
Contributor

This was fixed by 6f3ce30 but still running out of memory in workplace_location_simulate

Running step 'workplace_location_simulate'
Traceback (most recent call last):
  File "simulation.py", line 9, in <module>
    orca.run(["workplace_location_simulate"])
  File "C:\Anaconda2\lib\site-packages\orca\orca.py", line 1876, in run
    step()
  File "C:\Anaconda2\lib\site-packages\orca\orca.py", line 780, in __call__
    return self._func(**kwargs)
  File "e:\projects\asim\activitysim\activitysim\defaults\models\workplace_locat
ion.py", line 47, in workplace_location_simulate
    sample_size=50)
  File "e:\projects\asim\activitysim\activitysim\activitysim.py", line 279, in i
nteraction_simulate
    df = interaction_dataset(choosers, alternatives, sample_size)
  File "e:\projects\asim\activitysim\activitysim\mnl.py", line 117, in interacti
on_dataset
    suffixes=('', '_r'))
  File "C:\Anaconda2\lib\site-packages\pandas\tools\merge.py", line 35, in merge

@toliwaga
Copy link
Contributor

The out of memory in workplace_location_simulate is a design issue.

workplace_location_simulate calls interaction_simulate

choosers = persons_merged.to_frame()
alternatives = destination_size_terms.to_frame()

choices, _ = asim.interaction_simulate(choosers,
                                       alternatives,
                                       workplace_location_spec,
                                       skims=skims,
                                       locals_d=locals_d,
                                       sample_size=50)

interaction_simulate calls interaction_dataset to build a dataframe of choosers joined with a sample of alternative TAZ destinations with all columns for each table

Specifically sample_size rows for each person in persons merged with 182 columns from persons_merged and 19 columns from destination_choice_size_terms

This fully normalized table of persons_merged X alts_sample has 199 columns and is pretty big:

(7,000,000 persons_merged) * (50 alternatives) * (199 columns) * (8 bytes per columns)

7000000 * 50 * 199 * 8/ (1024_1024_1024) = 518.93 GB

No way the will merge will succeed even with 170 Gig or RAM.

@toliwaga
Copy link
Contributor

toliwaga commented Apr 29, 2016

One way to handle this, at least in the near term, is to chunk the computation (say into tranches of 500K persons) so it fits in memory.

There is a lot to be said of retaining the simplicity of a fully normalized interaction_dataset.

On the other hand, this already (even if we chunk the computation to fit in memory) comes at the cost of using a random sample of 50 candidate destinations.

@wusun2
Copy link

wusun2 commented Apr 29, 2016

Eventually, when we start distributing choices among multiple machines (a task in the future?), the population needs to be chunked anyway.

@fscottfoti
Copy link
Contributor

I think you're hinting at this in your comment, but in UrbanSim we make the alternative compromise. We chunk people into discrete types (compromise = losing continuous variables) and we have 20 person "types" we interact with the whole choice set which might be 20k zones in this case.

(20 discretized persons) * (20,000 alternatives) * (199 columns) * (8 bytes per column)

20 * 20000 * 199 * 8/ (102410241024) = 636MB is I'm not mistaken. Which leaves you a lot of room to add detail in the how you discretize the persons data.

This is my pet approach because it also allows you to view (map) the PDF for each of the discretized persons which gives a ton of transparency to the model that I don't think comes through by looking at coefficients. Anyway, I know you know this but just though I'd voice the thought.

And of course you sample from the PDF for all the persons of each type which is trivially fast.

Also is there any reason to use doubles (large choice choice sets and small probabilities?)? Seems like using floats might could halve the memory use right there...

@fscottfoti
Copy link
Contributor

I guess the O-D problem kills you here too as you have 20 types x 20k origin zones to the 20k destination zones and then you're almost exactly back where you started. Does that sound right?

@toliwaga
Copy link
Contributor

@fscottfoti yes, I agree it is a nice strategy, but that the OD thing may mess it up.

I am wondering if it might be possible to tweak the algorithm so that it is not necessary to actually do this join but use some kind of indirection, maybe in an apply, instead.

@fscottfoti
Copy link
Contributor

It's certainly appealing but I don't think the apply approach will work - this gets back into just running Python loops and will probably kill the performance (which sounds bad already). But worth trying to know what you're working with.

Are all 199 columns used in the utility calculation or maybe you can force the user to ask for the subset that they will need. (We also wrote something in UrbanSim that looks for column names in the spec strings and then only fetches those columns, which worked well). I don't think this will get you all the way there but it could help.

@bstabler
Copy link
Contributor

bstabler commented May 5, 2016

We probably need to batch (or chunk) the choosers in some way in order to manage our expectations about memory requirements. Let's look at a few cases for how the overall model could operate:

  1. run just a small sample of HHs, say 1000, or about 2500 persons, in which case 2500 work locations by 50 alternatives is not too much data
  2. run 2.7m HHs on one machine, in which case work location requires a massive data table
  3. run 2.7m HHs on one machine, but in separate batches in order to not use too much RAM. One key question then is where to create the batches:
  • within the MNL class
  • one step up in the work location model
  • further up in the overall run process
  • or even further up by creating separate simulation runs, each with a set of HHs for example
    4) run 2.7m HHs across a cluster of machines, in which case each machine is processing some fixed amount of choosers and all the models can be run within the RAM available

In addition, we don't want to require batching by HHs since some models operate on persons, tours, trips, etc. If we add it to the MNL class, then we'll need to add it to a number of other models as well. If we want to restrict the design to batch process HHs, then we could add it higher up in the stack. I think we need to think about this some more and probably document the data tables, their sizes, etc for each sub-model.

@bstabler
Copy link
Contributor

bstabler commented May 8, 2016

The CDAP model needs to be chunked as well

Running step 'cdap_simulate'
Traceback (most recent call last):
  File "simulation.py", line 12, in <module>
    orca.run(["cdap_simulate"])
  File "C:\Anaconda2\lib\site-packages\orca\orca.py", line 1876, in run
    step()
  File "C:\Anaconda2\lib\site-packages\orca\orca.py", line 780, in __call__
    return self._func(**kwargs)
  File "e:\projects\asim\activitysim\activitysim\defaults\models\cdap.py", line 64, in cdap_simulate
    cdap_all_people)
  File "e:\projects\asim\activitysim\activitysim\cdap\cdap.py", line 430, in run_cdap
    hh_utils = initial_household_utilities(ind_utils, people, hh_id_col)
  File "e:\projects\asim\activitysim\activitysim\cdap\cdap.py", line 197, in initial_household_utilities
    tz.concat(itertools.product(range(len(alts)), repeat=hh_size)))
MemoryError

@toliwaga
Copy link
Contributor

Chunking will a bit trickier since it has to chunk at HH granularity. I had to take some time off last week for another project but now working on this.

@toliwaga
Copy link
Contributor

It looks like we also need to chunk to calls to interaction_simulate in vectorize_tour_scheduling

@toliwaga
Copy link
Contributor

I requisitioned an underutilized machine and have been working on setting up a ubuntu large scale test server with a terrabyte of disk and 128G Ram for running full scale tests. I am running the full scale model on it to see if I can replicate a somewhat confusing error Ben encountered running on the windows server. It may just be a fencepost error in my hh_chunking code. But in any case it will be nice to have full scale test servers running under both windows and linux as we move forward...

@bstabler
Copy link
Contributor

bstabler commented Sep 9, 2016

this is stuck on #116 right now

@toliwaga
Copy link
Contributor

toliwaga commented Oct 19, 2016

Full run completed successfully in 9 hours and 25 minutes on modelling server with chunk size of 100K. (2,732,722 households, 7,053,334 persons)

Rewrite of interaction_simulate reduced runtime of non_mandatory_tour_frequency from 404 minutes to 109 minutes.

                                 seconds       minutes
compute_accessibility                216           3.6
school_location_simulate            1020          17.0
workplace_location_simulate         2710          45.2
auto_ownership_simulate              177           3.0
cdap_simulate                       3190          53.2
mandatory_tour_frequency             497           8.3
mandatory_scheduling                1590          26.5
non_mandatory_tour_frequency        6563         109.4
destination_choice                  2411          40.2
non_mandatory_scheduling            3066          51.1
patch_mandatory_tour_destination     127           2.1
tour_mode_choice_simulate           6387         106.5
trip_mode_choice_simulate           5653          94.2
------------------------------------------------------
all models                         33967         566.1  (9 hours and 25 minutes)

@danielsclint
Copy link
Author

danielsclint commented Oct 19, 2016

Single-thread, right?

@toliwaga
Copy link
Contributor

Yes, Single-thread.

@wusun2
Copy link

wusun2 commented Oct 19, 2016

That's encouraging. Here is a quick comparison of SANDAG 2012 performance:
Households: 1,086,628, Pop: 3,143,418; CT-RAMP part runtime of ONE iteration of 100% sample (your is also one iteration?): 15 hrs and 6 mins.

This is not an apple to apple comparison though; the unimplemented features in the current ActivitySim (logsum etc.) as well as SANDAG specific features and complexities. However, your test has a much larger population.

@bstabler
Copy link
Contributor

I'm going to close this issue now since we have a complete run. I'm sure we'll run into additional issues, but we'll make those individual issues instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants