New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large Scale Performance Test #61
Comments
@DavidOry will post all the MTC TM1 matrices for RSG; I will convert to OMX
|
|
Also, I do not believe Fletcher fully built out the model spec files for tour mode choice. For reference, our UECs are here: https://github.com/MetropolitanTransportationCommission/travel-model-one/tree/master/model-files/model. |
Just to quantify the performance issues with 3D skims. With the current hack that just uses the same in-memory matrix over and over: With the hack removed, and reading the matrices from disk: |
In the 3/25/2016 project meeting, @toliwaga indicated he made revisions to skim.py to test re-reading the distance matrix from disk as opposed to just getting it from memory. Is @toliwaga comment above the outcome of those tests, or is this something different. Does any code need to be checked in? What indicates this task as complete? |
With the in-memory hack removed, and reading the matrices from disk, time to execute step 'mode_choice_simulate' went from 3.24 s to 96.51 s But that is just the tip of the iceberg. There are a number of other cascading shortcuts that make the actual time for a full scale run a whole lot longer than that. There were more such shortcuts than I expected and I have been exploring them, hoping to reach closure in time for tomorrow's meeting but I am happy to update you on my progress so far. Let me just write them up... |
Thanks, I would appreciate it. As someone that is not intimately familiar with the code, I think it will help me be more prepared for our discussion. |
In the current (master) version of activitysim, neither Skims3D nor mode_choice_simulate are fully implemented. The current version runs nice and fast, but there are a lot of compromises in the implementation. Skims3D provides the framework for two different strategies of skim handling.
Neither of these approaches are fully implemented in practice. I have been exploring the demand approach, which does something that the internal documentation refers to as "being sneaky to make it go faster" This really has several parts. They are:
Although mode_choice_simulate references many different skims, with implicit key indexing of different time of day layers, the example only ever accesses on skim file (DISTANCE) Furthermore, as things stand, this file is only read three times (once for each time period defined in settings.yaml) instead of hundreds of times because the stacked skim data is cached and subsequent requests for (different) are 'satisfied' by returning the same old stacked DISTANCE skim from memory. Needless to say, this is a lot faster than re-reading it from disk. The first thing I did was disable the caching, forcing the skim file (still DISTANCE) to be re-read from the omx file. That accounted for the increase in runtime from 3.24 seconds to 96.51 seconds that I mentioned above. But it was still reading the some wrong data over and over again. So next, I built a omnibus skim omx file containing all the mtc_tm1 skims for all time periods (using keys like DRIVEALONEFREE__PM) and read them in on demand. In other words, I made the following change:
This increased the runtime to 320.71 seconds This is now at least using the correct skims, but there are few other issues...
mode_choice_simulate only runs _mode_choice_simulate for tours of type 'eatout', and runs them with dest TAZ 'workplace_taz' instead of the 'destination' taz that the destination.py step assigns to tours for non mandatory tours. So I fixed mode.py to iterate over the various tour types using the correct dest taz depending on tour type (workplace_taz for work tours, school_taz for school tours, and destination for non mandatory tour types ('shopping' 'othdiscr' 'eatout' 'social' 'othmaint' 'escort') Unfortunately the processor fails for work and school type tours, which is what I have been looking into today. Meanwhile, extending it to run all the non mandatory tours (skipping school and work) increased the runtime to 1864.96 seconds (31 minutes).
All these time are for runs with households_sample_size set to 10000 in settings.yaml. Increasing it may make things slower, but hopefully not by much if pandas is pulling its weight,
Presumably they were commented out because the DISTWALK and DISTBIKE skims were lacking. Adding them back in will presumably lead to a minor performance hit. I have a couple of ideas in mind for what to do next, that I would be happy to discuss at the meeting.
But I am very open to any other ideas! |
Ran out of memory in the first step (school_location_simulate) on my 16 gig macbook pro trying to run with household sampling turned off... |
Let's try it on our San Diego server - sdmdlppw01, it has 160 GB of RAM |
Adding in work and school mode choice brings runtime to 2383 secs (39.7 minutes) That's with 10K household sample size and walk and bike rows commented out in tour_mode_choice.csv |
Since reading OMX matrices from disk into Python objects is relatively slow, we may try to first pickle all the skims so they are closer to Python native format when being re-read from disk |
I have a version working (in large-scale-test branch) that lazy-loads and retains the skims so they are only read in once, which certainly improves performance. I am working on a version that forces the preload since it is hard to benchmark and instrument in the lazy-load version. It looks like there is room for significant improvement in both performance and memory footprint. |
preloading all (full scale) mtc skims used by current model (which just uses tods 'AM', 'MD', 'PM') takes 1 min 31 secs and takes 8 gig of memory headroom, without running any models or loading any h5 (person/hh/landuse) data. So obviously the raw skim load is not what is slowing things down. I suspect the Skim3D stacking is the primary culprit. As things stand, _mode_choice_simulate is currently the only user of Skim3D class, but this presumably can and will change. Skim3D.init preloads (copying and stacking the matrices) any skims with tuple keys in the skims parameter, which is apparently expected ordinarily (though not required) to be the skims injectable. _mode_choice_simulate creates two Skim3D objects, which means it potentially consumes twice as much memory as needed when preloading skims passed in the skims parameter. On the other hand, when it loads skims lazily on demand from the omx file, it only loads the TOD layers that are actually used in the context of the dataframe it is applied to, which, depending on the chronological distribution of trips may be a subset of the available layers. Also, of course, lazy loading only loads the skims that are actually used in the eval expressions. |
Benchmarks (memory high water mark and runtime on a 16G Macbook Air) for running: mtc_tm1_sf_test - 20 TAZ mini dataset for test suite runs on travis Showing times with skim preload vs. lazy load ################################################## mtc_tm1_sf_test (20 TAZ skims, 1000 hh sample)################################################## ####### households_sample_size = 1000 max memory footprint = 0.83 GB ####### households_sample_size = 1000 max memory footprint = 0.8 GB ################################################## mtc_tm1_sf (190 TAZ skims, 1000 hh sample)################################################## ####### households_sample_size = 1000 max memory footprint = 0.79 GB ####### households_sample_size = 1000 max memory footprint = 0.61 GB |
This is great! You said in an earlier comment that the memory footprint had been 8Gb, but this report indicates a max footprint of less than a gig. Is this the difference between the full skims and subset, or something else? |
Benchmarks (memory high water mark and runtime on a 16G Macbook Air) for running: mtc_tm1 - full 1454 TAZ example dataset Showing memory and runtimes with skim preload vs. lazy load running with HH sample size of 1000 vs. 10,000 model level runtime breakdown for 10K runs suggests that some models are scaling better than others. These ones seem to be potential problems: workplace_location_simulate I am running a 20K sample now to try to get some trend lines... ################################################## mtc_tm1 (1454 TAZ skims)##################################################
####### households_sample_size = 1000 max memory footprint = 10.06 GB ####### households_sample_size = 1000 max memory footprint = 1.62 GB ####### households_sample_size = 10000 max memory footprint = 9.62 GB 'school_location_simulate': 18.18 s ####### households_sample_size = 10000 max memory footprint = 8.7 GB 'school_location_simulate': 18.26 s |
@danielsclint - yes those are small skims, I just posted data for larger skims. The low memory footprint for the 1K full 1454 TAZ skims shows that the lazy load does save memory compared to full skim preload (1.6GB vs. 10GB) but once the HH sample gets bigger, this advantage declines. Also, while the full skim preload has a memory high water mark of 10GB, the actual size of the loaded skims is less than 3GB, so a lot of the apparent overhead of preloading is probably due to the garbage collector not running while we cycle through the skims. |
Interestingly, one of the major causes of high initial runtime figures (from two weeks ago) was caused, not by actual skim handling, but by the innocent-looking code fragments in the Skim3D getitem function, which was being called for every skim reference for every tour type (i.e. thousands of times):
Changing the above code to say |
... and those are skims presumably from an ABM integrated with static traffic assignment using conventional level-of-service skims. Over time, some of our ABMs will be integrated with dynamic traffic assignment using individual LOS through deep integration of individual trajectories. Just something to keep in mind as this project evolves. |
Second on Guy's comment. We will be looking to decide in the Fall whether we integrate our DTA with CT-RAMP or, hopefully, ActivitySim. |
As we suspected, the full example (all zones (1450) and all HHs (2.7m)) appears to have gotten stuck in work location choice. I first setup and ran the 20 zone example on our 20 core 160GB San Diego Windows Server, and it ran within a couple minutes. Next, I swapped in the full example inputs, and it ran through pre-loading of skims and school location choice in around 15 to 30 minutes, but has now been in work location choice for a couple hours. The max memory usage during school location was ~125GB. It is probably stuck in the roundtrip_auto_time_to_work virtual column issue that @toliwaga mentioned on the call. I'm going to let it run over the weekend just for fun. |
I reopened #72 because it seems to be related to the "gotten stuck" @bstabler refers to above (or at least a similar phenomenon I encounter. I get some very peculiar behavior at the end of workplace_location_simulate, after adding workplace_taz column and calling add_dfependent_columns:
This ends up calling distance_to_work, which never returns. Maybe the apparently redundant workplace_taz column in persons.py was serving some arcane function that we didn't grok... I don't fully understand what is happening yet, but I notice that distance_to_work depends on workplace_taz and in fact it is nor returning from that call, stuck in a reindex call trying to crate a pandas series.
|
The hang was caused by skims.py injectables This caused some very squirrelly behavior in which pandas spent a lot of time needlessly reshaping the requested objects, sometimes indefinitely. |
This was fixed by 6f3ce30 but still running out of memory in workplace_location_simulate
|
The out of memory in workplace_location_simulate is a design issue. workplace_location_simulate calls interaction_simulate
interaction_simulate calls interaction_dataset to build a dataframe of choosers joined with a sample of alternative TAZ destinations with all columns for each table Specifically sample_size rows for each person in persons merged with 182 columns from persons_merged and 19 columns from destination_choice_size_terms This fully normalized table of persons_merged X alts_sample has 199 columns and is pretty big: (7,000,000 persons_merged) * (50 alternatives) * (199 columns) * (8 bytes per columns) 7000000 * 50 * 199 * 8/ (1024_1024_1024) = 518.93 GB No way the will merge will succeed even with 170 Gig or RAM. |
One way to handle this, at least in the near term, is to chunk the computation (say into tranches of 500K persons) so it fits in memory. There is a lot to be said of retaining the simplicity of a fully normalized interaction_dataset. On the other hand, this already (even if we chunk the computation to fit in memory) comes at the cost of using a random sample of 50 candidate destinations. |
Eventually, when we start distributing choices among multiple machines (a task in the future?), the population needs to be chunked anyway. |
I think you're hinting at this in your comment, but in UrbanSim we make the alternative compromise. We chunk people into discrete types (compromise = losing continuous variables) and we have 20 person "types" we interact with the whole choice set which might be 20k zones in this case. (20 discretized persons) * (20,000 alternatives) * (199 columns) * (8 bytes per column) 20 * 20000 * 199 * 8/ (102410241024) = 636MB is I'm not mistaken. Which leaves you a lot of room to add detail in the how you discretize the persons data. This is my pet approach because it also allows you to view (map) the PDF for each of the discretized persons which gives a ton of transparency to the model that I don't think comes through by looking at coefficients. Anyway, I know you know this but just though I'd voice the thought. And of course you sample from the PDF for all the persons of each type which is trivially fast. Also is there any reason to use doubles (large choice choice sets and small probabilities?)? Seems like using floats might could halve the memory use right there... |
I guess the O-D problem kills you here too as you have 20 types x 20k origin zones to the 20k destination zones and then you're almost exactly back where you started. Does that sound right? |
@fscottfoti yes, I agree it is a nice strategy, but that the OD thing may mess it up. I am wondering if it might be possible to tweak the algorithm so that it is not necessary to actually do this join but use some kind of indirection, maybe in an apply, instead. |
It's certainly appealing but I don't think the Are all 199 columns used in the utility calculation or maybe you can force the user to ask for the subset that they will need. (We also wrote something in UrbanSim that looks for column names in the spec strings and then only fetches those columns, which worked well). I don't think this will get you all the way there but it could help. |
We probably need to batch (or chunk) the choosers in some way in order to manage our expectations about memory requirements. Let's look at a few cases for how the overall model could operate:
In addition, we don't want to require batching by HHs since some models operate on persons, tours, trips, etc. If we add it to the MNL class, then we'll need to add it to a number of other models as well. If we want to restrict the design to batch process HHs, then we could add it higher up in the stack. I think we need to think about this some more and probably document the data tables, their sizes, etc for each sub-model. |
The CDAP model needs to be chunked as well
|
Chunking will a bit trickier since it has to chunk at HH granularity. I had to take some time off last week for another project but now working on this. |
It looks like we also need to chunk to calls to interaction_simulate in vectorize_tour_scheduling |
I requisitioned an underutilized machine and have been working on setting up a ubuntu large scale test server with a terrabyte of disk and 128G Ram for running full scale tests. I am running the full scale model on it to see if I can replicate a somewhat confusing error Ben encountered running on the windows server. It may just be a fencepost error in my hh_chunking code. But in any case it will be nice to have full scale test servers running under both windows and linux as we move forward... |
this is stuck on #116 right now |
Full run completed successfully in 9 hours and 25 minutes on modelling server with chunk size of 100K. (2,732,722 households, 7,053,334 persons) Rewrite of interaction_simulate reduced runtime of non_mandatory_tour_frequency from 404 minutes to 109 minutes.
|
Single-thread, right? |
Yes, Single-thread. |
That's encouraging. Here is a quick comparison of SANDAG 2012 performance: This is not an apple to apple comparison though; the unimplemented features in the current ActivitySim (logsum etc.) as well as SANDAG specific features and complexities. However, your test has a much larger population. |
I'm going to close this issue now since we have a complete run. I'm sure we'll run into additional issues, but we'll make those individual issues instead. |
This task needs to be flushed out more in terms of what is expected and what types of benchmarks will be assigned to this. Looking for @bstabler to provide more insight on this over the next couple of weeks.
The text was updated successfully, but these errors were encountered: