Backlog - TUM-DAML/seml GitHub Wiki

Possible and planned features:

  • Proper documentation via readthedocs
  • Tests
  • define multiple (redundant) MongoDB instances to leverage our MongoDB replica. (https://pymongo.readthedocs.io/en/stable/examples/high_availability.html)
  • Raise error when setting both a value and a sub-value, e.g. a and a.b, except when one of them is from a sub-config, then show the usual "special overwrites general" warning.
  • Warn and confirm if user is deleting or resetting a submitted job (before cancelling it)
  • Pass "Batch job submission failed" errors to user
  • detect Slurm state instead of only whether the job got killed, reflect in database (raw (last seen), seml-equivalent); potentially remove KILLED state, add "reason" field instead; remove detect-killed, make seml status the primary way of detecting Slurm states
  • seml pause command. Detecting paused experiments requires parsing the REASON field. We could print the REASON for pending experiments also with seml status.
  • SEML portable mode for publishing source code: Start local experiment directly from config (no MongoDB and Slurm, only Sacred)
  • include for including SEML base configs (which are merged into other configs)
  • integrate with Tensorboard HParams for nicer evaluation
  • Pausing experiments (hold/stop/suspend)
  • Override config parameters via command line argument
  • Ability to manually select a mongodb.config in the command
  • Recommend using separate DB for each user. Maybe provide installation instructions?

Low priority:

  • Job chaining via sbatch --dependency
  • suspend (and then restart) experiments
  • Integrate with PyTorch Lightning (what would this even mean? Some convenience functions?)
  • Automatic hyperparameter optimization (via Sherpa, hyperopt, Optuna?) -> parallel, on Cluster
  • Make Sacred optional (makes SEML easier for beginners, and Sacred might be discontinued at some point)
  • detect local experiments that failed outside Python by using the heartbeat