03‐29‐2024 Weekly Tag Up - HIRO-group/marl-experiments GitHub Wiki

Attendees

Bug identified and squashed
- Our updated "double threshold" single objective policy performs very similarly (according to g1) to the "queue" policy in some cases
- In other cases, the "queue" policy performs BETTER than the single objective policy (according to g1)
  - This could mean that our new reward definition induces speed behavior similar to that of the queue policy
    - Though the queue policy always performs MUCH BETTER than the double threshold policy according to g2
  - We may have just needed to train the double threshold policies longer (though I doubt it because it seemed like we obtained convergence)
Reran batch offline learning experiment with "queue" policy and "threshold 1.0/13.89" policy
- See Experiment 21.1

Take out square root part of reward definition and regenerate table from experiment 20
- We need the queue policy to perform much worse than the excess speed policy