moma batch processing - nimwegenLab/moma GitHub Wiki
Introduction
This page describes how to process large datasets with MoMA with the moma_batch_process
.
Table of contents
Usage
The program that facilitates the batch-processing is called
moma_batch_run
. You can print usage help with moma_batch_run -help
:
$ moma_batch_run -help
usage: moma_batch_run [-h]
(-help | -version | -delete_gl_analysis yaml_config_file | -track yaml_config_file | -curate yaml_config_file | -export yaml_config_file)
[-l LOG] [-select SELECT] [-f] [-ff]
optional arguments:
-h, --help show this help message and exit
-l LOG, --log LOG path to the log-file for this batch-run; derived from
'yaml_config_file' and stored next to it, if not
specified
-select SELECT, --select SELECT
run on selection of GLs specified in Python
dictionary-format; GLs must be defined in
'yaml_config_file'; example: "{0:{1,2}, 3:{4,5}}",
where 0, 3 are position indices and 1, 2, 4, 5 are GL
indices
-f, --force force the operation
-ff, --fforce force operation when deleting data; e.g. with option
'-delete-analysis'
required (mutually exclusive) arguments:
-help
-version show program's version number and exit
-delete_gl_analysis yaml_config_file
delete analysis files of specified GLs; WARNING: this
will remove ALL analysis-files for the GLs
-track yaml_config_file
run batch-tracking of GLs
-curate yaml_config_file
run interactive curation of GLs
-export yaml_config_file
run batch-export of tracking results
The options -track
, -curate
and -export
allow you to split the
curation into a three step process, where only step 2 requires user
interaction:
Workflow hides waiting the times for:
- running U-Net and first optimization
- time to export results
About GPU usage
The U-Net model is only run when calling `moma_batch_run -track ...`. This step generates all necessary data and stores it to disk for the following steps. It is accelerated significantly when run on GPU.
The -curate
and -export
steps do not benefit from a GPU, because data is read from
disk. Hence they can be run on CPU only without performance draw-backs.
To use this workflow, you must define a YAML configuration file like this. Note that in YAML the indentation to define a block is critical:
file_version: 0.3.0
slurm: False
preprocessing_path: /media/micha/T7/data_michael_mell/moma_test_data/000_development/feature/20220121-fix-loading-of-curated-datasets/Lis/20211026/20211026__lis__process_positions__output__20211101-013152
default_moma_arg:
analysis: 'test_analysis'
p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties'
tmax: 10
pos:
0:
gl: {
12:,
23:
}
15:
gl: {4,
18
}
moma_arg: {
p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties',
tmax: 100
}
It defines:
preprocessing_path
: The path to the preprocessed experiment (ie. the folder containing thePosXX
folders).slurm
: Controls if the script will dispatch processing of individual GLs to Slurm during tracking and export. See here for more details. Possible values: [False
,True
,path_slurm_head
]default_moma_arg
: A list of arguments that will be passed to MoMA for processing each GL. These values are turned into command-line parameters, when MoMA is called for a given TIFF-image of a GL. Necessary parameters are:analysis
(required): The name of the analysis that is being performed. This name should be unique. A folder with this name will be created inside each processed GL in the directory-tree specified inpreprocessing_path
.p
(optional): Path to themm.properties
file. I recommend making a dedicated copy of yourmm.properties
for each batch-processing analysis. You can copy it from your home folder on Scicore here:~/.moma/mm.properties
. It makes sense to keep this dedicated copy next to your YAML file.tmax
(optional): This is an example of an optional argument. This would be passed as-tmax 10
to themoma
command during execution.
pos
(required): A the index numbers of the positions that should be processed. For each index value underpos
you then specify the GL index that should processed usinggl
.gl
(required): The list of indexes that will be processed for a given position.moma_arg
(optional): This can be added optional to either apos
orgl
field to overwrite the settings indefault_moma_arg
, in case you need to process some GLs with different settings. The restrictions for this setting are:- It is not allowed to overwrite the
analysis
setting (to ensure the same name for every analysis that is run from this script). - It must overwrite all other settings that are defined in
default_moma_arg
.
- It is not allowed to overwrite the
With the batch-processing correctly configured (see step 1) and the YAML file you can then start your workflow:
- Run the batch tracking:
moma_batch_run -track /path/to/config_file.yaml
- Run the batch curation:
moma_batch_run -curate /path/to/config_file.yaml
- Run the batch tracking:
moma_batch_run -export /path/to/config_file.yaml
If you want to re-run the analysis for a particular GL you need to pass
the argument -f
to enable overwriting previous data (though a backup
of the previous state will created; but this still needs more testing,
so tread with care...). Furthermore you can use the option -select
to run on a subsets of the GLs defined in the YAML file. So to recurate
a subset of the GLs you would do something like this:
moma_batch_run -f -curate /path/to/config_file.yaml -select "{0:{1,2}, 3:{4,5}}" # "{0:{1,2}, 3:{4,5}}" must be positions and GLs that are defined in the YAML file
Running the batch-processing will create in each GL folder files similar to this:
Additional notes:
-
For the moment and until more in-depth testing, it makes sense to run only subsections of the YAML file piece-by-piece. You can achieve this by commenting out the parts that you do not want to run on. E.g. this only process GL 12 of position 0:
file_version: 0.3.0 slurm: True preprocessing_path: /media/micha/T7/data_michael_mell/moma_test_data/000_development/feature/20220121-fix-loading-of-curated-datasets/Lis/20211026/20211026__lis__process_positions__output__20211101-013152 default_moma_arg: analysis: 'test_analysis' p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties' tmax: 10 pos: 0: gl: { 12:, # 23: } # 15: # gl: {4, # 18 # } # moma_arg: { # p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties', # tmax: 100 # }
About log-file output
When a log file is specified using the option `-l LOG, --log LOG`, then all information is logged this file.
When the user does not specify a log-file, moma_batch_run
will
automatically generate separate log files for each of the arguments:
-track
, -curate
, -export
, and -delete
. These files are
generated inside a sub-directory next to the YAML file, which has the
same name as the YAML file with "_log" appended.
I.e. for the YAML file test_analysis.yaml
the produced folder
structure would look like this, once track, curate and export were run:
Each of the log files is created when this run-type was performed for the first time.
Restarting a batch-analysis from scratch
These steps are somewhat dangerous, because you can accidentally delete files, if you make a mistake, so: Please double-check your actions when following this.
You can do the following to start over from scratch, if you need to restart your analysis from scratch.
- Delete the log-file of the batch run: It is located next to the YAML
config-file of the batch run and has the same name, but
with
.log
extension. (take care not to accidentally delete the YAML file itself) - Delete all analysis folders that were generated (this is a bit dangerous):
I find all analysis folders that were generated by the script from bash. You can do this by running this command:
find . -type d -name ANALYSIS_NAME -prune -exec echo {} \;
where ANALYSIS_NAME
is the string that you set in the YAML file with
the field analysis: 'ANALYSIS_NAME'
.
You can delete the folders after you confirmed that the found folder are correct (and please TRIPLE CHECK this!) by running:
find . -type d -name ANALYSIS_NAME -prune -exec rm -rf {} \;
Using Slurm
Configuration
The script can dispatch the processing of each GL to separate Slurm jobs, when the Slurm workload manager is available. This will be used for the options -track
and -export
when active. The setting slurm
in the YAML controls this behavior. It takes the following values:
False
: Slurm use disabled.True
: Slurm will be used with a default header that is located at$HOME/.moma/batch_run_slurm_header.txt
. The default header file will be created, if it does not exist.- Path: When a path to a Slurm header is provided, the script will use that.
This the content of the default Slurm header file:
#SBATCH --mem=32G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --qos=30min
#SBATCH --time=00:29:59
#SBATCH --open-mode=append
NOTE 1: Do not remove the setting
#SBATCH --open-mode=append
, when editing this file. It ensures continuous log entries in themoma.log
of each GL.
NOTE 2: MoMA runs in single-threaded mode with default settings, so you job does not need to allocate more than 1 CPU core.
Output
The script will output commands to query and cancel the Slurm jobs after dispatching them:
FINISHED DISPATCHING SLURM JOBS.
To QUERY slurm jobs run:
squeue -u $USER | grep '[ANALYSIS_NAME]*'
To CANCEL slurm jobs run:
squeue -u $USER | grep '[ANALYSIS_NAME]*' | awk '{{print $1}}' | xargs scancel
where [ANALYSIS_NAME]
will be replaced with the analysis name that was provided in the YAML file with setting analysis:
.
The following files will be created in the track-data folder of each GL belonging to the analysis:
moma.log
: The log-file is the same for MoMA runs that use Slurm or not. It is continually appended to across runs, which makes possible to track changes to the tracking/curation/export of a GL.moma_slurm_script_track.sh
: The bash-script that was used to run the Slurm job for tracking.moma_slurm_script_export.sh
: The bash-script that was used to run the Slurm job for exporting.