03‐29‐2024 Weekly Tag Up - HIRO-group/marl-experiments GitHub Wiki
Attendees
Status
- Bug identified and squashed
- Our updated "double threshold" single objective policy performs very similarly (according to g1) to the "queue" policy in some cases
- In other cases, the "queue" policy performs BETTER than the single objective policy (according to g1)
- This could mean that our new reward definition induces speed behavior similar to that of the queue policy
- Though the queue policy always performs MUCH BETTER than the double threshold policy according to g2
- We may have just needed to train the double threshold policies longer (though I doubt it because it seemed like we obtained convergence)
- Reran batch offline learning experiment with "queue" policy and "threshold 1.0/13.89" policy
Next Steps
- Take out square root part of reward definition and regenerate table from experiment 20
- We need the queue policy to perform much worse than the excess speed policy