Multi CPU parallelization using MPI and Python scripts - abria/TeraStitcher GitHub Wiki

The Align and the Merge steps of the stitching pipeline are computation and I/O intensive. To speedup these steps when processing large images we leverage the ability of TeraStitcher to perform these steps on a portion of the image and implemented ParaStitcher, a driver program written in Python, that launches multiple instances of the tool in parallel.

A similar approach has been applied to the conversion task to speedup format conversion. To this aim we implemented ParaConverter, another driver program written in Python which lunches multiple instances of TeraConverter.

ParaStither and ParaConverter can be used with both Python2 and Python3 and they have been developed in collaboration with CINECA within the HBP project.

Since Align, Merge and conversion are I/O intensive operations, both tools can approach ideal speedups (nCPUs = nx faster) only if the I/O subsystem provides enough parallelism in both secondary storage access and data transfer. This is especially true for Merge and conversion. See the Limitations section for more details.

Prerequisites

  1. Python 2 or Python 3.

  2. a working MPI implementation. MPICH and Open-MPI are two valid options. For Linux and MacOS platforms we suggest Open-MPI v1.10 (download and installation instructions). For Windows platforms we suggest Microsoft MPI (download and installation instructions). To check that your MPI works and which implementation is running, just type mpirun --version or mpiexec --version (on Windows platforms only mpiexec is available). You should also check that you have a working mpicc compiler: just type mpicc and check if it can execute.

If you are using CentOS or RedHat, there is a known bug in the preinstalled mpich which may conflict with the requirement #3. Please remove any existing MPI installation (e.g. yum remove mpich*) and then install openmpi with yum install mpi4py-openmpi. To enable openmpi, run module load openmpi-x86_64 or module load mpi/openmpi-x86_64 (if you don't have the module utility, install it with yum install environment-modules).

  1. MPI for Python (mpi4py, installation instructions)

  2. both a working TeraStitcher and TeraConverter executables (download) in the system PATH (i.e. you must be able to call terastitcher and teraconverter from command-line from any place)

Download

  • [link] ParaStitcher Python driver
  • [link] ParaConverter Python driver

Usage

See Section 3 and Appendix D of the TeraTools guide for usage of the scripts and further details.

See How to Use Python, MPI and GPUs for an example of use of ParaStitcher.

See the following demo for how to use ParaConverter.

Demo of ParaConverter

  • download and unpack the example dataset at this link
  • follow the demo of TeraConverter to see how long it takes to convert the example image with one instance of TeraConverter
  • follow the demo below and in your case choose the appropriate number of processes depending on your CPUs

    Click to play

In this demo, we obtain a speedup of 1.4x and an efficiency of about 70% with 2 instances of TeraConverter running on a MacBook Pro with a dual-core CPU and input/output from/to the local SSD storage.

the efficiency is calculated as the ratio between speedup and degree of parallelism, so in our case it is 1.4/2 = 0.7, or 70%. For more details about the highest achievable speedup and efficiency, see the next section.

Limitations of ParaConverter

Conversion of large 3D images is an I/O bound operation. Although some computation might be needed (e.g. for compression/decompression), the whole dataset to be converted has to be first read from the original format and then written to the final format.

That said, in order to achieve high speedups with ParaConverter, one has to consider the following limitations:

  • the number of physical CPU cores available: the -np option should not set a number of processes higher than the available physical CPU cores. Thus, with a quad-core CPU, the highest (ideal) achievable speedup is 4x, and so on.
  • the underlying I/O system: it is highly desirable to have an I/O subsystem supporting parallel I/O, i.e. an hardware I/O architecture that can execute multiple physical accesses to secondary storage and related data transfers in parallel. If I/O is not really parallelised, for the Amdahl law, the achievable speedup is necessarily limited.
  • the file format: it is highly desirable to have a file format (both in input and in output) allowing to access only a portion of the dataset in a single I/O operation so that when multiple I/O operations are issued in parallel the same data are not accessed and transferred twice. Among the file formats supported by TeraConverter, tiled formats (both TIFF and Vaa3D), all Vaa3D formats, and HDF5 formats do meet this requirement.

An important observation is that the achieved speedups should be evaluated taking into account the performance of the I/O subsystem. In other words it may be trivial achieve high speedups when the I/O subsystem supports transparently parallel I/O operations, while it could be considered a remarkable result to achieve even limited speedups thanks to a careful management of all other resources involved in the conversion on systems that do not support parallel I/O or support it at a limited extent.

⚠️ **GitHub.com Fallback** ⚠️