Backlog - TUM-DAML/seml GitHub Wiki

Possible and planned features:

Proper documentation via readthedocs
Tests
define multiple (redundant) MongoDB instances to leverage our MongoDB replica. (https://pymongo.readthedocs.io/en/stable/examples/high_availability.html)
Raise error when setting both a value and a sub-value, e.g. a and a.b, except when one of them is from a sub-config, then show the usual "special overwrites general" warning.
Warn and confirm if user is deleting or resetting a submitted job (before cancelling it)
Pass "Batch job submission failed" errors to user
detect Slurm state instead of only whether the job got killed, reflect in database (raw (last seen), seml-equivalent); potentially remove KILLED state, add "reason" field instead; remove detect-killed, make seml status the primary way of detecting Slurm states
seml pause command. Detecting paused experiments requires parsing the REASON field. We could print the REASON for pending experiments also with seml status.
SEML portable mode for publishing source code: Start local experiment directly from config (no MongoDB and Slurm, only Sacred)
include for including SEML base configs (which are merged into other configs)
integrate with Tensorboard HParams for nicer evaluation
Pausing experiments (hold/stop/suspend)
Override config parameters via command line argument
Ability to manually select a mongodb.config in the command
Recommend using separate DB for each user. Maybe provide installation instructions?

Low priority:

Job chaining via sbatch --dependency
suspend (and then restart) experiments
Integrate with PyTorch Lightning (what would this even mean? Some convenience functions?)
Automatic hyperparameter optimization (via Sherpa, hyperopt, Optuna?) -> parallel, on Cluster
Make Sacred optional (makes SEML easier for beginners, and Sacred might be discontinued at some point)
detect local experiments that failed outside Python by using the heartbeat