MoMA tutorial batch processing - nimwegenLab/moma GitHub Wiki

Note

This is part 3 of the MoMA tutorial. It explains how to analyse large datasets by running several instances of MoMA in batches (either sequentially or in parallel. Please read the tutorial introduction for a general overview of the analysis workflow.

Batch processing is most useful when performed on an HPC server with the Slurm job scheduler. Please note that using MoMA with Slurm requires setting up a Gurobi license server.

Overview
Usage
About log-file output
Restarting a batch-analysis from scratch
Using Slurm
- Configuration
- Output

Overview

This page describes how to process large datasets with MoMA with the moma_batch_process script.

Different approaches have been implemented to improve performance when analyzing large datasets:

running U-Net on GPU. Unfortunately, this can be cumbersome to set up on HPC (at least on ours). The improvement is more modest for prediction than for training, and this computation can easily be deferred by running U-Net on CPU in parallel and saving the output to disk (see 3).
running sequential batches (no user interaction required but long).
running batches in parallel with Slurm.

This part of the tutorial explains approaches 2 and 3, which rely on saving data computed by U-Net on disk and reloading them at MoMA startup when they are available.

Usage

The program that facilitates the batch-processing is called moma_batch_run.

`moma_batch_run` options

You can print usage help with moma_batch_run -help:

$ moma_batch_run -help
usage: moma_batch_run [-h]
                      (-help | -version | -delete_gl_analysis yaml_config_file | -track yaml_config_file | -curate yaml_config_file | -export yaml_config_file)
                      [-l LOG] [-select SELECT] [-f] [-ff]

optional arguments:
  -h, --help            show this help message and exit
  -l LOG, --log LOG     path to the log-file for this batch-run; derived from
                        'yaml_config_file' and stored next to it, if not
                        specified
  -select SELECT, --select SELECT
                        run on selection of GLs specified in Python
                        dictionary-format; GLs must be defined in
                        'yaml_config_file'; example: "{0:{1,2}, 3:{4,5}}",
                        where 0, 3 are position indices and 1, 2, 4, 5 are GL
                        indices
  -f, --force           force the operation
  -ff, --fforce         force operation when deleting data; e.g. with option
                        '-delete-analysis'

required (mutually exclusive) arguments:
  -help
  -version              show program's version number and exit
  -delete_gl_analysis yaml_config_file
                        delete analysis files of specified GLs; WARNING: this
                        will remove ALL analysis-files for the GLs
  -track yaml_config_file
                        run batch-tracking of GLs
  -curate yaml_config_file
                        run interactive curation of GLs
  -export yaml_config_file
                        run batch-export of tracking results

The options -track , -curate and -export allow you to split the curation into a three step process, where only step 2 requires user interaction:

Workflow hides waiting the times for:

running U-Net and first optimization
time to export results

Note

About GPU and performance
The U-Net model is only run when calling `moma_batch_run -track ...`. This step generates all necessary data and stores it to disk for the following steps (only this step is accelerated significantly when run on GPU, but this is not needed in practice when analysing growth channels in paralell with Slurm). The -curate and -export steps do not benefit from a GPU, because data is read from disk.

To use this workflow, you must define a YAML configuration file like this. Note that in YAML the indentation to define a block is critical:

file_version: 0.3.0
slurm: False
preprocessing_path: /media/micha/T7/data_michael_mell/moma_test_data/000_development/feature/20220121-fix-loading-of-curated-datasets/Lis/20211026/20211026__lis__process_positions__output__20211101-013152
default_moma_arg:
   analysis: 'test_analysis'
   p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties'
  tmax: 10
pos:
  0:
    gl: {
      12:,
      23:
    }
  15:
    gl: {4,
         18
        }
    moma_arg: {
        p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties',
        tmax: 100
      }

It defines:

preprocessing_path : The path to the preprocessed experiment (ie. the folder containing the PosXX folders).
slurm: Controls if the script will dispatch processing of individual GLs to Slurm during tracking and export. See here for more details. Possible values: [False, True, path_slurm_head]
default_moma_arg : A list of arguments that will be passed to MoMA for processing each GL. These values are turned into command-line parameters, when MoMA is called for a given TIFF-image of a GL. Necessary parameters are:
- analysis (required): The name of the analysis that is being performed. This name should be unique. A folder with this name will be created inside each processed GL in the directory-tree specified in preprocessing_path.
- p (optional): Path to the mm.properties file. I recommend making a dedicated copy of your mm.properties for each batch-processing analysis. You can copy it from your home folder on Scicore here: ~/.moma/mm.properties . It makes sense to keep this dedicated copy next to your YAML file.
- tmax (optional): This is an example of an optional argument. This would be passed as -tmax 10 to the moma command during execution.
pos (required): A the index numbers of the positions that should be processed. For each index value under pos you then specify the GL index that should processed using gl .
gl (required): The list of indexes that will be processed for a given position.
moma_arg (optional): This can be added optional to either a pos or gl field to overwrite the settings in default_moma_arg, in case you need to process some GLs with different settings. The restrictions for this setting are:
- It is not allowed to overwrite the analysis setting (to ensure the same name for every analysis that is run from this script).
- It must overwrite all other settings that are defined in default_moma_arg .

With the batch-processing correctly configured (see step 1) and the YAML file you can then start your workflow:

Run the batch tracking: moma_batch_run -track /path/to/config_file.yaml
Run the batch curation: moma_batch_run -curate /path/to/config_file.yaml
Run the batch tracking: moma_batch_run -export /path/to/config_file.yaml

If you want to re-run the analysis for a particular GL you need to pass the argument -f to enable overwriting previous data (though a backup of the previous state will created; but this still needs more testing, so tread with care...). Furthermore you can use the option -select to run on a subsets of the GLs defined in the YAML file. So to recurate a subset of the GLs you would do something like this:

moma_batch_run -f -curate /path/to/config_file.yaml -select "{0:{1,2}, 3:{4,5}}"  # "{0:{1,2}, 3:{4,5}}" must be positions and GLs that are defined in the YAML file

Running the batch-processing will create in each GL folder files similar to this:

Additional notes:

For the moment and until more in-depth testing, it makes sense to run only subsections of the YAML file piece-by-piece. You can achieve this by commenting out the parts that you do not want to run on. E.g. this only process GL 12 of position 0:

file_version: 0.3.0
slurm: True
preprocessing_path: /media/micha/T7/data_michael_mell/moma_test_data/000_development/feature/20220121-fix-loading-of-curated-datasets/Lis/20211026/20211026__lis__process_positions__output__20211101-013152
default_moma_arg:
   analysis: 'test_analysis'
   p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties'
  tmax: 10
pos:
  0:
    gl: {
      12:,
#      23:
    }
#  15:
#    gl: {4,
#         18
#        }
#    moma_arg: {
#        p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties',
#        tmax: 100
#      }

About log-file output

When a log file is specified using the option `-l LOG, --log LOG`, then all information is logged this file.

When the user does not specify a log-file, moma_batch_run will automatically generate separate log files for each of the arguments: -track , -curate , -export , and -delete . These files are generated inside a sub-directory next to the YAML file, which has the same name as the YAML file with "_log" appended. I.e. for the YAML file test_analysis.yaml the produced folder structure would look like this, once track, curate and export were run:

Each of the log files is created when this run-type was performed for the first time.

Restarting a batch-analysis from scratch

These steps are somewhat dangerous, because you can accidentally delete files, if you make a mistake, so: Please double-check your actions when following this.

You can do the following to start over from scratch, if you need to restart your analysis from scratch.

Delete the log-file of the batch run: It is located next to the YAML config-file of the batch run and has the same name, but with .log extension. (take care not to accidentally delete the YAML file itself)
Delete all analysis folders that were generated (this is a bit dangerous):

To find all analysis folders that were generated by the script from bash, you can execute:

find .  -type d -name ANALYSIS_NAME -prune -exec echo {} \;

where ANALYSIS_NAME is the string that you set in the YAML file with the field analysis: 'ANALYSIS_NAME'.

You can delete the folders after you confirmed that the found folder are correct (and please TRIPLE CHECK this!) by running:

find . -type d -name ANALYSIS_NAME -prune -exec rm -rf {} \;

Using Slurm

Configuration

The script can dispatch the processing of each GL to separate Slurm jobs, when the Slurm workload manager is available. This will be used for the options -track and -export when active. The setting slurm in the YAML controls this behavior. It takes the following values:

False: Slurm use disabled.
True: Slurm will be used with a default header that is located at $HOME/.moma/batch_run_slurm_header.txt. The default header file will be created, if it does not exist.
Path: When a path to a Slurm header is provided, the script will use that.

This the content of the default Slurm header file:

#SBATCH --mem=32G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --qos=30min
#SBATCH --time=00:29:59
#SBATCH --open-mode=append

Warning

Do not remove the setting #SBATCH --open-mode=append, when editing this file. It ensures continuous log entries in the moma.log of each GL.

Note

MoMA runs in single-threaded mode with default settings, so you job does not need to allocate more than 1 CPU core.

Output

The script will output commands to query and cancel the Slurm jobs after dispatching them:

FINISHED DISPATCHING SLURM JOBS.
To QUERY slurm jobs run:
squeue -u $USER | grep '[ANALYSIS_NAME]*'
To CANCEL slurm jobs run:
squeue -u $USER | grep '[ANALYSIS_NAME]*' | awk '{{print $1}}' | xargs scancel

where [ANALYSIS_NAME] will be replaced with the analysis name that was provided in the YAML file with setting analysis:.

The following files will be created in the track-data folder of each GL belonging to the analysis:

moma.log: The log-file is the same for MoMA runs that use Slurm or not. It is continually appended to across runs, which makes possible to track changes to the tracking/curation/export of a GL.
moma_slurm_script_track.sh: The bash-script that was used to run the Slurm job for tracking.
moma_slurm_script_export.sh: The bash-script that was used to run the Slurm job for exporting.

MoMA tutorial batch processing - nimwegenLab/moma GitHub Wiki

Table of contents

Overview

Usage

About log-file output

Restarting a batch-analysis from scratch

Using Slurm

Configuration

Output

⚠️ GitHub.com Fallback ⚠️

MoMA tutorial batch processing - nimwegenLab/moma GitHub Wiki

Table of contents

Overview

Usage

About log-file output

Restarting a batch-analysis from scratch

Using Slurm

Configuration

Output

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️