Backlog - TUM-DAML/seml GitHub Wiki
Possible and planned features:
- Proper documentation via readthedocs
- Tests
- define multiple (redundant) MongoDB instances to leverage our MongoDB replica. (https://pymongo.readthedocs.io/en/stable/examples/high_availability.html)
- Raise error when setting both a value and a sub-value, e.g.
a
anda.b
, except when one of them is from a sub-config, then show the usual "special overwrites general" warning. - Warn and confirm if user is deleting or resetting a submitted job (before cancelling it)
- Pass "Batch job submission failed" errors to user
- detect Slurm state instead of only whether the job got killed, reflect in database (raw (last seen), seml-equivalent); potentially remove KILLED state, add "reason" field instead; remove detect-killed, make
seml status
the primary way of detecting Slurm states seml pause
command. Detecting paused experiments requires parsing the REASON field. We could print the REASON for pending experiments also withseml status
.- SEML portable mode for publishing source code: Start local experiment directly from config (no MongoDB and Slurm, only Sacred)
include
for including SEML base configs (which are merged into other configs)- integrate with Tensorboard HParams for nicer evaluation
- Pausing experiments (hold/stop/suspend)
- Override config parameters via command line argument
- Ability to manually select a
mongodb.config
in the command - Recommend using separate DB for each user. Maybe provide installation instructions?
Low priority:
- Job chaining via sbatch
--dependency
- suspend (and then restart) experiments
- Integrate with PyTorch Lightning (what would this even mean? Some convenience functions?)
- Automatic hyperparameter optimization (via Sherpa, hyperopt, Optuna?) -> parallel, on Cluster
- Make Sacred optional (makes SEML easier for beginners, and Sacred might be discontinued at some point)
- detect local experiments that failed outside Python by using the heartbeat