SEMCOG got stuck running the example according to the website instructions
We'll pull and release a new version today
Probably better not to lock down the version but let it take advantage of dependency updates and fix issues as they arise
Pre-computing / caching to support the TVPB
Jeff making progress
He's working on understanding the tradeoffs of pre-computing versus on-demand
He implemented not re-calculating duplicate tap-to-tap utilities and this sped up Marin by 4x
He implemented caching for tap-to-tap utilities, but this was less advantageous
If we can, we may want to cache the n-best-paths list for omaz,dmaz,tod,demographic_segment
Maybe we cache it using a fast and multi-process friendly technology such as arrow/feather
And then we either pre-compute or possibly update on-demand
For a full sample, pre-compute might be better, but for a 100 HH sample, maybe on-demand is better
It depends on how sparse the data is
Discussion
I tried my best to explain things, but I think Doyle needs to explain next time
What's the dimensionality of the problem?
Marin TM2: 6000 mazs, 6200 taps, the average TAP to MAZ ratio for walk access is 114, the average for drive is 7. Note that this does not include the tap_serves_new_lines function (aka tapLines), which trims MAZ to TAP pairs based on if TAPs further away do not serve new lines. If we cropped to the 1.2 miles used for the tap_serves_new_lines function then we get 63 taps per maz.
Marin has a collapsed set of MTC TM2 mazs, which is 30k mazs
We think it makes sense to pre-compute the path components, but we're not sure about the N-best tap pairs since it's very big and sparse
Pre-computing seems like a reasonable/understandable/simple solution - just compute the components (in parallel by omaz), save them, and look them up later. It may not be completely optimal, but it also might be easier for code maintenance and developer use than something a bit better but more complex
Does pre-computing create too big a file and how sparse is the data set and therefore is the tradeoff not worth it?
This depends a bit on the settings we spec'd that are consistent with TM2:
max_paths_across_tap_sets: 3, which is the number of N-best tap pairs in total to keep for each omaz,dmaz,tod,demographic_segment
max_paths_per_tap_set: 1, which is the number of N-best tap pairs to keep within each skim set (premium, local, etc.)
Marin TM2 has 6000 mazs * 63 taps * 63 taps * 6000 mazs * 5 time periods * 3 demographic segments = 2,143,260,000,000 (2 trillion) potential paths but not all evaluated since many not needed and some tap-tap pairs not available (including in all time periods, etc)
This is a big number so we want to implement a solution that considers tradeoffs of runtime, ram, disk space, behavioral design, code maintenance, developer burden, etc.
Jeff to give us an update next week
Profiling of memory usage:
MTC skims 6.7gb in memory, 826 skims for 5 time periods * 1475 zones
SEMCOG skims 47gb 1480 skims for 5 time periods * 2900 zones
@Stefan to add PSRC numbers - would be 60gb for 870 skims for 12 time periods * 3900 zones
Multiprocessing creates lots of table using chunksize and so this uses lots of memory as well
Next time discuss stats on pipeline table sizes
Discuss estimation feature completion progress and #354 next time