Backlog - TUM-DAML/seml GitHub Wiki
Possible and planned features:
- Proper documentation via readthedocs
- Tests
- define multiple (redundant) MongoDB instances to leverage our MongoDB replica. (https://pymongo.readthedocs.io/en/stable/examples/high_availability.html)
- Raise error when setting both a value and a sub-value, e.g.
aanda.b, except when one of them is from a sub-config, then show the usual "special overwrites general" warning. - Warn and confirm if user is deleting or resetting a submitted job (before cancelling it)
- Pass "Batch job submission failed" errors to user
- detect Slurm state instead of only whether the job got killed, reflect in database (raw (last seen), seml-equivalent); potentially remove KILLED state, add "reason" field instead; remove detect-killed, make
seml statusthe primary way of detecting Slurm states seml pausecommand. Detecting paused experiments requires parsing the REASON field. We could print the REASON for pending experiments also withseml status.- SEML portable mode for publishing source code: Start local experiment directly from config (no MongoDB and Slurm, only Sacred)
includefor including SEML base configs (which are merged into other configs)- integrate with Tensorboard HParams for nicer evaluation
- Pausing experiments (hold/stop/suspend)
- Override config parameters via command line argument
- Ability to manually select a
mongodb.configin the command - Recommend using separate DB for each user. Maybe provide installation instructions?
Low priority:
- Job chaining via sbatch
--dependency - suspend (and then restart) experiments
- Integrate with PyTorch Lightning (what would this even mean? Some convenience functions?)
- Automatic hyperparameter optimization (via Sherpa, hyperopt, Optuna?) -> parallel, on Cluster
- Make Sacred optional (makes SEML easier for beginners, and Sacred might be discontinued at some point)
- detect local experiments that failed outside Python by using the heartbeat