moma batch processing - nimwegenLab/moma GitHub Wiki

Introduction

This page describes how to process large datasets with MoMA with the moma_batch_process.

Table of contents

Usage

The program that facilitates the batch-processing is called moma_batch_run .  You can print usage help with moma_batch_run -help :

$ moma_batch_run -help
usage: moma_batch_run [-h]
                      (-help | -version | -delete_gl_analysis yaml_config_file | -track yaml_config_file | -curate yaml_config_file | -export yaml_config_file)
                      [-l LOG] [-select SELECT] [-f] [-ff]

optional arguments:
  -h, --help            show this help message and exit
  -l LOG, --log LOG     path to the log-file for this batch-run; derived from
                        'yaml_config_file' and stored next to it, if not
                        specified
  -select SELECT, --select SELECT
                        run on selection of GLs specified in Python
                        dictionary-format; GLs must be defined in
                        'yaml_config_file'; example: "{0:{1,2}, 3:{4,5}}",
                        where 0, 3 are position indices and 1, 2, 4, 5 are GL
                        indices
  -f, --force           force the operation
  -ff, --fforce         force operation when deleting data; e.g. with option
                        '-delete-analysis'

required (mutually exclusive) arguments:
  -help
  -version              show program's version number and exit
  -delete_gl_analysis yaml_config_file
                        delete analysis files of specified GLs; WARNING: this
                        will remove ALL analysis-files for the GLs
  -track yaml_config_file
                        run batch-tracking of GLs
  -curate yaml_config_file
                        run interactive curation of GLs
  -export yaml_config_file
                        run batch-export of tracking results

The options -track , -curate and -export allow you to split the curation into a three step process, where only step 2 requires user interaction:

Workflow hides waiting the times for:

  • running U-Net and first optimization
  • time to export results

About GPU usage

The U-Net model is only run when calling `moma_batch_run -track ...`. This step generates all necessary data and stores it to disk for the following steps. It is accelerated significantly when run on GPU.

The -curate and -export steps do not benefit from a GPU, because data is read from disk. Hence they can be run on CPU only without performance draw-backs.

To use this workflow, you must define a YAML configuration file like this. Note that in YAML the indentation to define a block is critical:

file_version: 0.3.0
slurm: False
preprocessing_path: /media/micha/T7/data_michael_mell/moma_test_data/000_development/feature/20220121-fix-loading-of-curated-datasets/Lis/20211026/20211026__lis__process_positions__output__20211101-013152
default_moma_arg:
   analysis: 'test_analysis'
   p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties'
  tmax: 10
pos:
  0:
    gl: {
      12:,
      23:
    }
  15:
    gl: {4,
         18
        }
    moma_arg: {
        p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties',
        tmax: 100
      }

It defines:

  • preprocessing_path : The path to the preprocessed experiment (ie. the folder containing the PosXX folders).
  • slurm: Controls if the script will dispatch processing of individual GLs to Slurm during tracking and export. See here for more details. Possible values: [False, True, path_slurm_head]
  • default_moma_arg : A list of arguments that will be passed to MoMA for processing each GL. These values are turned into command-line parameters, when MoMA is called for a given TIFF-image of a GL. Necessary parameters are:
    • analysis (required): The name of the analysis that is being performed. This name should be unique. A folder with this name will be created inside each processed GL in the directory-tree specified in preprocessing_path.
    • p  (optional): Path to the mm.properties  file. I recommend making a dedicated copy of your mm.properties for each batch-processing analysis. You can copy it from your home folder on Scicore here: ~/.moma/mm.properties . It makes sense to keep this dedicated copy next to your YAML file.
    • tmax  (optional): This is an example of an optional argument. This would be passed as -tmax 10 to the moma command during execution.
  • pos  (required): A the index numbers of the positions that should be processed. For each index value under pos you then specify the GL index that should processed using gl .
  • gl (required): The list of indexes that will be processed for a given position.
  • moma_arg  (optional): This can be added optional to either a pos or gl field to overwrite the settings in default_moma_arg , in case you need to process some GLs with different settings. The restrictions for this setting are:
    • It is not allowed to overwrite the analysis setting (to ensure the same name for every analysis that is run from this script).
    • It must overwrite all other settings that are defined in default_moma_arg .

With the batch-processing correctly configured (see step 1) and the YAML file you can then start your workflow:

  1. Run the batch tracking: moma_batch_run -track /path/to/config_file.yaml
  2. Run the batch curation: moma_batch_run -curate /path/to/config_file.yaml
  3. Run the batch tracking: moma_batch_run -export /path/to/config_file.yaml

If you want to re-run the analysis for a particular GL you need to pass the argument -f to enable overwriting previous data (though a backup of the previous state will created; but this still needs more testing, so tread with care...). Furthermore you can use the option -select to run on a subsets of the GLs defined in the YAML file. So to recurate a subset of the GLs you would do something like this:

moma_batch_run -f -curate /path/to/config_file.yaml -select "{0:{1,2}, 3:{4,5}}"  # "{0:{1,2}, 3:{4,5}}" must be positions and GLs that are defined in the YAML file

Running the batch-processing will create in each GL folder files similar to this:

Additional notes:

  • For the moment and until more in-depth testing, it makes sense to run only subsections of the YAML file piece-by-piece. You can achieve this by commenting out the parts that you do not want to run on. E.g. this only process GL 12 of position 0:

    file_version: 0.3.0
    slurm: True
    preprocessing_path: /media/micha/T7/data_michael_mell/moma_test_data/000_development/feature/20220121-fix-loading-of-curated-datasets/Lis/20211026/20211026__lis__process_positions__output__20211101-013152
    default_moma_arg:
       analysis: 'test_analysis'
       p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties'
      tmax: 10
    pos:
      0:
        gl: {
          12:,
    #      23:
        }
    #  15:
    #    gl: {4,
    #         18
    #        }
    #    moma_arg: {
    #        p: '/home/micha/Documents/01_work/15_moma_notes/02_moma_development/feature/20220801-implement-python-batch-processing-script/mm.properties',
    #        tmax: 100
    #      }
    

About log-file output

When a log file is specified using the option `-l LOG, --log LOG`, then all information is logged this file.

When the user does not specify a log-file,  moma_batch_run will automatically generate separate log files for each of the arguments: -track , -curate , -export , and -delete . These files are generated inside a sub-directory next to the YAML file, which has the same name as the YAML file with "_log" appended.
I.e. for the YAML file test_analysis.yaml the produced folder structure would look like this, once track, curate and export were run:

Each of the log files is created when this run-type was performed for the first time.

Restarting a batch-analysis from scratch

These steps are somewhat dangerous, because you can accidentally delete files, if you make a mistake, so: Please double-check your actions when following this.

You can do the following to start over from scratch, if you need to restart your analysis from scratch.

  1. Delete the log-file of the batch run: It is located next to the YAML config-file of the batch run and has the same name, but with .log extension. (take care not to accidentally delete the YAML file itself)
  2. Delete all analysis folders that were generated (this is a bit dangerous):

I find all analysis folders that were generated by the script from bash. You can do this by running this command:

find .  -type d -name ANALYSIS_NAME -prune -exec echo {} \;

where ANALYSIS_NAME is the string that you set in the YAML file with the field analysis: 'ANALYSIS_NAME'.

You can delete the folders after you confirmed that the found folder are correct (and please TRIPLE CHECK this!) by running:

find . -type d -name ANALYSIS_NAME -prune -exec rm -rf {} \;

Using Slurm

Configuration

The script can dispatch the processing of each GL to separate Slurm jobs, when the Slurm workload manager is available. This will be used for the options -track and -export when active. The setting slurm in the YAML controls this behavior. It takes the following values:

  • False: Slurm use disabled.
  • True: Slurm will be used with a default header that is located at $HOME/.moma/batch_run_slurm_header.txt. The default header file will be created, if it does not exist.
  • Path: When a path to a Slurm header is provided, the script will use that.

This the content of the default Slurm header file:

#SBATCH --mem=32G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --qos=30min
#SBATCH --time=00:29:59
#SBATCH --open-mode=append

NOTE 1: Do not remove the setting #SBATCH --open-mode=append, when editing this file. It ensures continuous log entries in the moma.log of each GL.

NOTE 2: MoMA runs in single-threaded mode with default settings, so you job does not need to allocate more than 1 CPU core.

Output

The script will output commands to query and cancel the Slurm jobs after dispatching them:

FINISHED DISPATCHING SLURM JOBS.
To QUERY slurm jobs run:
squeue -u $USER | grep '[ANALYSIS_NAME]*'
To CANCEL slurm jobs run:
squeue -u $USER | grep '[ANALYSIS_NAME]*' | awk '{{print $1}}' | xargs scancel

where [ANALYSIS_NAME] will be replaced with the analysis name that was provided in the YAML file with setting analysis:.

The following files will be created in the track-data folder of each GL belonging to the analysis:

  • moma.log: The log-file is the same for MoMA runs that use Slurm or not. It is continually appended to across runs, which makes possible to track changes to the tracking/curation/export of a GL.
  • moma_slurm_script_track.sh: The bash-script that was used to run the Slurm job for tracking.
  • moma_slurm_script_export.sh: The bash-script that was used to run the Slurm job for exporting.