Project Meeting 2023.07.18 - ActivitySim/activitysim GitHub Wiki

Agenda

  • Data Type Update (WSP)
  • Status Update on Technical Tasks
    • Unit Testing: Replace Orca (CS)
    • New Data Pipeline File Format (CS) -> done
    • Configuration Documentation / Pydantic implementation (CS)
    • Input Checker (RSG)
    • Component Documentation (CS)
    • User's Guide (CS)
    • Memory leak check (WSP T&M task, TBD after high priority work is finished)

Meeting Notes

Data Type Update (WSP)

Presentation: ActivitySim Data Types Task - Progress Update 07-18-2023.pptx

  • WSP provided an update on potentially using Pandas Categoricals, as compared to the previously presented IntEnum approach
  • Advantages of Pandas Categoricals
    • Works seamlessly with the current panda operations
    • We can keep most UECs or annotations as is (with the enum option, we would have to replace those values in the UECs)
    • Better compatibility with tracing. It’ll look the same as it does today, as opposed to integers with the IntEnum approach where the integer values aren’t helpful
  • Disadvantages / caveats
    • There are some instances where we are dynamically creating string values that overwrite existing values – they need to be redefined with categories or they will throw an error.
    • There are places in the source code with hard-coded values, so need to be careful.
    • Need to pre-define all possible categories
    • Pandas Categorical can be fragile – need to be really careful, merge and groupby operations for example
      • Found 3 groupbys. Addressing those would get the code to run (and results were replicated) but haven’t looked through all the UECs and fully implemented this approach.
    • Can’t perform numeric operations – so only convert string values and optimize numeric values later
    • Presentation includes a comparison between the two approaches: Pandas Categoricals and IntEnum
  • Compatibility with sharrow
    • Sharrow can’t optimize with the Categoricals.
    • Still passing an array of string values and doing string comparisons
    • Possible that if there are few categorial values, encode those values with integers for comparison, and could be included in the compiler – however, this would be a considerable effort and doesn’t currently have this feature. You get the memory benefits of the data type update but not additional run time benefits. *** Recommendation is to go with Pandas Categorical**
    • Level of effort is lower and it is backwards compatible
    • When incorporating a data model, then we could smartly incorporate IntEnum.
  • Next steps
    • Implement Categoricals
    • Rerun ARC and SANDAG memory profiling
    • Then work on numeric data type optimizations
  • Other questions/comments
    • Would string operations in the vehicle type model cause issues? Sijia to look into this more.
    • Run with full sample/full scale tests for the two implementations to understand all the possible string combinations
    • For categoricals – 90% of the changes would be under the hood and user wouldn’t experience most changes. There may ne instances where, in other implementations, users could make their own changes and cause it to break. They need to make sure that any expressions they add need to work with categoricals – as well as pandas, sharrow, etc. Sijia can document:
      • Best practices/guidelines for writing expressions that would be compatible
      • Where she changed things
  • Consensus: no objections to moving forward with Categorical implementation.

Status Update on Technical Tasks

  • Unit Testing: Replace Orca (CS)
    • RSG and WSP reviewed.
    • WSP is building the data type update off this code.
    • Jeff fixed a bunch of things and added documentation. However, there is still one outstanding thing to be fixed and should be done this week.
      • Sharrow requires an input file with list of TAZs.
      • Without this input files, iff external stations are at the end, it won’t cause sharrow issues because the positions of the internal stations are still correct. If external stations are mixed in, then there will be problems.
  • New Data Pipeline File Format (CS) -> done
  • Configuration Documentation / Pydantic implementation (CS)
    • Waiting for orca update to be completed.
  • Input Checker (RSG)
    • Draft revised scope to be discussed at partner’s only meeting.
  • Component Documentation (CS)
    • Waiting for orca update to be completed.
  • User's Guide (CS)
    • Waiting for orca update to be completed.
  • Memory leak check (WSP T&M task, TBD after high priority work is finished)