Project Meeting 2020.10.27 - ActivitySim/activitysim GitHub Wiki

Technical Call

  • Continue discussion on TVPB implementation
    • Bill and Joel talked and the memory needs for TVPB are vectors for maz-tap and matrices for tap-tap and so the memory implications are less than discussed last week
    • A primary motivation for TVPB is to use the more refined spatial system for better modeling of non-motorized travel
    • MTC has a liberal maz-tap ratio that has not really been optimized on the network side
    • So max walk of 3 miles is generous but trimmed to 1.2 miles in ct-ramp
    • MTC has the same TVPB code as SANDAG
    • Important to include the tapLines file, which lists lines served by TAP in order to prune the maz-tap possibilities list. Taps further away without new service are dropped from consideration
    • TM2 transit network is pretty disaggregate since it was based on GTFS and therefore has route variations (skipped stops) and so lots of routes are retained
    • The transit routes are currently being rebuilt to make the routes more planning like (i.e. abstract)
    • For efficiency, maz-tap and tap-tap utilities are calculated just once and on-demand (and could be pre-calculated)
    • Code then loops across possible access and egress tap pairs, adds the already calculated path utility components, ranks the tap-pairs, and selects the best N
    • If transit selected, then makes a choice from the best N
    • It uses generic utilities for ranking and then re-calcs person specific utilities in mode choice for just the N best
    • We might have a formal write-up on the design from when we spec'd this out with Dave
    • Follow-up - here's the TM2 design papers and I don't see a really useful doc
    • In terms of pruning the possible paths, the tapLines idea and also skipping tap-tap pairs with IVT==0 can be done
    • The pre-exponential of utilities was done in the old SANDAG version but not in the current version for TM2, SANDAG, CMAP since we introduced the person specific calcaluations
    • If we pre-define market segments (which by the way we have done for asim) then we could exponentiate
    • What we need is a speedy ranking procedure, good logic to skip non-relevant tap pairs, and pre-calculating utility components
    • Doyle's understanding of the problem is inline with Joel's
    • We have implemented the tapLines functionality
    • There is no current max walk distance, but we could add this as a settings (say 1.2 miles like Marin)
    • We're not currently caching maz-tap calculations since they are very small and already super fast
    • But if the maz-tap calculations get more complex, then we may want to cache
    • tap-tap utility is being computed on demand and saved to a cache for future use
    • Asim code is working in chunks and so we need to de-duplicate calculations within chunks; this is now working
    • You can also retain your cache for a later subsequent run if desired
    • We're not doing person specific utilities, just using pre-defined market (demographic) segments, as spec'd
    • We've not implemented the optimization of skipping tap pairs with no IVT
    • CT-RAMP had a UEC feature where it skipped the 2+ alternatives if an expression NA'd alt 1 and the expression applies to all alts
    • Maybe we add a tap-tap utility filter expression file; like a constraint matrix in EMME
    • This would be a good generic improvement that applies to all activitysim expression solving
    • TM2 has a pre-processor to turn off duplicate tap pairs across skim sets as well
    • Testing on both Marin and SF county since they have different maz-tap ratios
    • Currently implemented optimizations with runtimes
      • Test example - 18 minutes
      • plus Remove redundant calcs within chunks - 7 minutes
      • plus tap-tap caching - 4 minutes
      • plus tap line pruning - 45 seconds
      • All together - 23 secs
    • Arrow/feather being used for the caching
    • Saves a lot of memory and runs super fast
    • The question now is how to implement dynamic and growing cache across multiprocesses
    • Memory requirements / performance / synchronization across processes and the need to avoid blockages
    • Can store tables in memmapped way on disk to free up RAM
    • May work for other shared data - skims and shadow prices
    • Basically use arrow for shared memory objects
    • Arrows will likely be the replacement for pandas backend
    • pandas uses numpy as a backend today
    • Here's the key article from the pandas created that spawn arrow; thanks Stefan
    • arrow includes things like native support for null values, better support for columns of different types, etc; its basically a better pandas backend
    • arrow in-memory like pandas, for tables, but no helper functions
    • feather is the file format and is super smart and uses memory mapped
    • arrow is 1D arrays so to wrap with numpy you reshape on-demand
    • This would add new dependencies to activitysim, which we need to be mindful of
    • Next steps
      • ct-ramp behavior has been replicated
      • figure out how to multiprocess and share / update data
      • maybe arrow/feather
      • maybe replace string operations with factors for more efficient data storage in numpy
        • factors not supported in pandas hdf5 storage so would need to wrap I/O currently
        • need to know universe of factors when creating
        • factors have better support in arrow
    • We're at the point where we're trying the few possible good ideas based on the abstract architecture design
    • Jeff Newman says using arrow for skims really works
    • He treated them as a column and reshaped with I/O
    • Can't compress on disk so files same size as in memory
    • Here's Newman's prototype; thanks Jeff
    • Going with an on-demand approach since geographic organization doesn't really work
    • TM2 disaggregate accessibilities will eventually use this code as well
    • Basic idea - create a small carefully controlled synpop that covers the markets, run the models to get the destination choice logsums, and then use these instead of the aggregate accessibilities
    • This is beyond this exercise, but its a good idea since it means consistent mode choice models for accessibility and actual mode choice, and planned for semcog
    • Jeff to soon share example with me so I can start comparing results to Marin TM2 and Jeff can continue with performance tuning
  • Discuss CDAP larch integration progress
    • EDB larch reader working and notebook drafted
    • Doyle to update the cdap coefficient files and code so we have named coefficients as opposed to just values
    • Do this after TVPB is in a good place
    • Will do our best to transform duplicate values into one coefficient so estimation is more stable
    • Joel share an example of the CDAP model since its complicated; thanks Joel for sharing the slides via email
    • Still need to write out the updated coefficient file; waiting on Doyle to update the format and then will implement
    • Now turn to the non-mandatory tour frequency model, which is the only model that implements the interaction_simulate EDB
  • Discuss ARC progress, questions, etc.
    • Everything stood-up, including new trip scheduling choice submodel, trip departure choice submodel, and cbd parking location submodel
    • ARC model is running from start to finish!
    • The trip departure choice model is very slow at this point, still working on performance, it builds many alternatives
    • Need to create tests cases for all three models for contribution
    • Have code and docs done
    • PSRC's RAM issues were actually chunk size related
    • ARC is running slower than expected; maybe due to chunk size?
    • The adaptive chunker should help here; its in the multi-zone branch
  • Joel join next week to discuss telecommuting