MoMA tutorial batch processing - nimwegenLab/moma GitHub Wiki
Note
This is part 3 of the MoMA tutorial. It explains how to analyse large datasets by running several instances of MoMA in batches (either sequentially or in parallel. Please read the tutorial introduction for a general overview of the analysis workflow.
Batch processing is most useful when performed on an HPC server with the Slurm job scheduler. Please note that using MoMA with Slurm requires setting up a Gurobi license server.
This page describes how to process large datasets with MoMA with the moma_batch_process
script.
Different approaches have been implemented to improve performance when analyzing large datasets:
- running U-Net on GPU. Unfortunately, this can be cumbersome to set up on HPC (at least on ours). The improvement is more modest for prediction than for training, and this computation can easily be deferred by running U-Net on CPU in parallel and saving the output to disk (see 3).
- running sequential batches (no user interaction required but long).
- running batches in parallel with Slurm.
This part of the tutorial explains approaches 2 and 3, which rely on saving data computed by U-Net on disk and reloading them at MoMA startup when they are available.
The program that facilitates the batch-processing is called moma_batch_run
.
`moma_batch_run` options
You can print usage help with moma_batch_run -help
:
$ moma_batch_run -help
usage: moma_batch_run [-h]
(-help | -version | -delete_gl_analysis yaml_config_file | -track yaml_config_file | -curate yaml_config_file | -export yaml_config_file)
[-l LOG] [-select SELECT] [-f] [-ff]
optional arguments:
-h, --help show this help message and exit
-l LOG, --log LOG path to the log-file for this batch-run; derived from
'yaml_config_file' and stored next to it, if not
specified
-select SELECT, --select SELECT
run on selection of GLs specified in Python
dictionary-format; GLs must be defined in
'yaml_config_file'; example: "{0:{1,2}, 3:{4,5}}",
where 0, 3 are position indices and 1, 2, 4, 5 are GL
indices
-f, --force force the operation
-ff, --fforce force operation when deleting data; e.g. with option
'-delete-analysis'
required (mutually exclusive) arguments:
-help
-version show program's version number and exit
-delete_gl_analysis yaml_config_file
delete analysis files of specified GLs; WARNING: this
will remove ALL analysis-files for the GLs
-track yaml_config_file
run batch-tracking of GLs
-curate yaml_config_file
run interactive curation of GLs
-export yaml_config_file
run batch-export of tracking results
The options -track
, -curate
and -export
allow you to split the curation into a three step process, where only step 2 requires user interaction:
Workflow hides waiting the times for:
- running U-Net and first optimization
- time to export results
Note
About GPU and performance
The U-Net model is only run when calling `moma_batch_run -track ...`. This step generates all necessary data and stores it to disk for the following steps (only this step is accelerated significantly when run on GPU, but this is not needed in practice when analysing growth channels in paralell with Slurm). The -curate
and -export
steps do not benefit from a GPU, because data is read from disk.
To use this workflow, you must define a YAML configuration file like this. Note that in YAML the indentation to define a block is critical:
file_version: 0.3.0
slurm: False
preprocessing_path: /media/micha/T7/data_michael_mell/moma_test_data/000_development/feature/20220121-fix-loading-of-curated-datasets/Lis/20211026/20211026__lis__process_positions__output__20211101-013152
default_moma_arg:
analysis: 'test_analysis'
p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties'
tmax: 10
pos:
0:
gl: {
12:,
23:
}
15:
gl: {4,
18
}
moma_arg: {
p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties',
tmax: 100
}
It defines:
-
preprocessing_path
: The path to the preprocessed experiment (ie. the folder containing thePosXX
folders). -
slurm
: Controls if the script will dispatch processing of individual GLs to Slurm during tracking and export. See here for more details. Possible values: [False
,True
,path_slurm_head
] -
default_moma_arg
: A list of arguments that will be passed to MoMA for processing each GL. These values are turned into command-line parameters, when MoMA is called for a given TIFF-image of a GL. Necessary parameters are:-
analysis
(required): The name of the analysis that is being performed. This name should be unique. A folder with this name will be created inside each processed GL in the directory-tree specified inpreprocessing_path
. -
p
(optional): Path to themm.properties
file. I recommend making a dedicated copy of yourmm.properties
for each batch-processing analysis. You can copy it from your home folder on Scicore here:~/.moma/mm.properties
. It makes sense to keep this dedicated copy next to your YAML file. -
tmax
(optional): This is an example of an optional argument. This would be passed as-tmax 10
to themoma
command during execution.
-
-
pos
(required): A the index numbers of the positions that should be processed. For each index value underpos
you then specify the GL index that should processed usinggl
. -
gl
(required): The list of indexes that will be processed for a given position. -
moma_arg
(optional): This can be added optional to either apos
orgl
field to overwrite the settings indefault_moma_arg
, in case you need to process some GLs with different settings. The restrictions for this setting are:- It is not allowed to overwrite the
analysis
setting (to ensure the same name for every analysis that is run from this script). - It must overwrite all other settings that are defined in
default_moma_arg
.
- It is not allowed to overwrite the
With the batch-processing correctly configured (see step 1) and the YAML file you can then start your workflow:
- Run the batch tracking:
moma_batch_run -track /path/to/config_file.yaml
- Run the batch curation:
moma_batch_run -curate /path/to/config_file.yaml
- Run the batch tracking:
moma_batch_run -export /path/to/config_file.yaml
If you want to re-run the analysis for a particular GL you need to pass the argument -f
to enable overwriting previous data (though a backup of the previous state will created; but this still needs more testing, so tread with care...). Furthermore you can use the option -select
to run on a subsets of the GLs defined in the YAML file. So to recurate a subset of the GLs you would do something like this:
moma_batch_run -f -curate /path/to/config_file.yaml -select "{0:{1,2}, 3:{4,5}}" # "{0:{1,2}, 3:{4,5}}" must be positions and GLs that are defined in the YAML file
Running the batch-processing will create in each GL folder files similar to this:
Additional notes:
-
For the moment and until more in-depth testing, it makes sense to run only subsections of the YAML file piece-by-piece. You can achieve this by commenting out the parts that you do not want to run on. E.g. this only process GL 12 of position 0:
file_version: 0.3.0 slurm: True preprocessing_path: /media/micha/T7/data_michael_mell/moma_test_data/000_development/feature/20220121-fix-loading-of-curated-datasets/Lis/20211026/20211026__lis__process_positions__output__20211101-013152 default_moma_arg: analysis: 'test_analysis' p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties' tmax: 10 pos: 0: gl: { 12:, # 23: } # 15: # gl: {4, # 18 # } # moma_arg: { # p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties', # tmax: 100 # }
When a log file is specified using the option `-l LOG, --log LOG`, then all information is logged this file.
When the user does not specify a log-file, moma_batch_run
will automatically generate separate log files for each of the arguments: -track
, -curate
, -export
, and -delete
. These files are generated inside a sub-directory next to the YAML file, which has the same name as the YAML file with "_log" appended. I.e. for the YAML file test_analysis.yaml
the produced folder structure would look like this, once track, curate and export were run:
Each of the log files is created when this run-type was performed for the first time.
These steps are somewhat dangerous, because you can accidentally delete files, if you make a mistake, so: Please double-check your actions when following this.
You can do the following to start over from scratch, if you need to restart your analysis from scratch.
- Delete the log-file of the batch run: It is located next to the YAML config-file of the batch run and has the same name, but with
.log
extension. (take care not to accidentally delete the YAML file itself) - Delete all analysis folders that were generated (this is a bit dangerous):
To find all analysis folders that were generated by the script from bash, you can execute:
find . -type d -name ANALYSIS_NAME -prune -exec echo {} \;
where ANALYSIS_NAME
is the string that you set in the YAML file with the field analysis: 'ANALYSIS_NAME'
.
You can delete the folders after you confirmed that the found folder are correct (and please TRIPLE CHECK this!) by running:
find . -type d -name ANALYSIS_NAME -prune -exec rm -rf {} \;
The script can dispatch the processing of each GL to separate Slurm jobs, when the Slurm workload manager is available. This will be used for the options -track
and -export
when active. The setting slurm
in the YAML controls this behavior. It takes the following values:
-
False
: Slurm use disabled. -
True
: Slurm will be used with a default header that is located at$HOME/.moma/batch_run_slurm_header.txt
. The default header file will be created, if it does not exist. - Path: When a path to a Slurm header is provided, the script will use that.
This the content of the default Slurm header file:
#SBATCH --mem=32G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --qos=30min
#SBATCH --time=00:29:59
#SBATCH --open-mode=append
Warning
Do not remove the setting #SBATCH --open-mode=append
, when editing this file. It ensures continuous log entries in the moma.log
of each GL.
Note
MoMA runs in single-threaded mode with default settings, so you job does not need to allocate more than 1 CPU core.
The script will output commands to query and cancel the Slurm jobs after dispatching them:
FINISHED DISPATCHING SLURM JOBS.
To QUERY slurm jobs run:
squeue -u $USER | grep '[ANALYSIS_NAME]*'
To CANCEL slurm jobs run:
squeue -u $USER | grep '[ANALYSIS_NAME]*' | awk '{{print $1}}' | xargs scancel
where [ANALYSIS_NAME]
will be replaced with the analysis name that was provided in the YAML file with setting analysis:
.
The following files will be created in the track-data folder of each GL belonging to the analysis:
-
moma.log
: The log-file is the same for MoMA runs that use Slurm or not. It is continually appended to across runs, which makes possible to track changes to the tracking/curation/export of a GL. -
moma_slurm_script_track.sh
: The bash-script that was used to run the Slurm job for tracking. -
moma_slurm_script_export.sh
: The bash-script that was used to run the Slurm job for exporting.