Troubleshooting - bregord/emcee-on-calcul-quebec GitHub Wiki

Below are some general problems one may run into, and ways that one might find a solution. These errors can be found by inspecting the JOBNAME.eNNNNN file output by Torque after your job has finished executing.

##Problems Running Jobs Often, the first indication that something will be wrong will be the automatically generated email (assuming you used the -M flag in your submission script) you will have received regarding the completion of your job. Often, your job will use all of its allocated time, regardless of whether or not it actually completed successfully or not. This is part of why tests are so highly recommended.

###Out of memory error Assuming everything went correctly in your setup and testing, this will likely be your most common issue. Essentially, running your script on a cluster means that you will be handing off your script to a bunch of different nodes. Each node will run multiples instances of your script, with each running on a different process. Once each process terminates, the results will be collected and your final result will be calculated. Each node has a certain memory allocation, and each process on that node will take a fraction of that memory. For example, 24 GB amongst 8 processes results in 3GB per process. For scripts involving a large number of parameters, or a large number of walkers and dimensions, this memory will be eaten up rather fast, leading to the Out of Memor Error.

There are several ways to solve this problem:

  • Reduce the number of processes per node.

  • Allocate more memory, using the -pmem argument on guillimin or the "mem=x" as part of the -l command on other clusters.

  • If a combination of the above two are not working, certain systems have nodes with more memory. Request the use of those. The procedure of how to do so will be outline on the calcul quebec wiki page here

###Exceeded Execution Time Increase the amount of time in the -l walltime argument

##MPI Errors ###Could not find mpi.h This problem results from mpi4py being unable to find the mpi library it requires. Often, this is the fault of having an incompatible version of openmpi. Try installing a different version from the modules. In my experience, downgrading is often more successful than upgrading. In addition, make sure that version of openmpi you select is compatible with your choice of compiler, otherwise you will need to unload your current compiler and load a new one that is compatible with mpi.

##Module Errors

Any problems running your job that are not output by the python interpreter will usually result from an improperly configured environment. Usually this means incompatible modules. In order to run emcee, there are several essential requirements.

  • A version of python compatible with your emcee program.

In addition to your python module, you will need to be able to install the numpy, scipy, mpi4py, and emcee packages in a virtual environment as described above. Numpy and scipy can often be installed as modules, but if not they will need to be installed with pip after ensuring the proper MKL (Math Kernel Library) is installed. This should also be available as a module.

  • A compiler.

  • An implementation of openmpi.

Finding the right compiler and implementation will require trial and error, as well as the investigation of version releases on the OpenMPI website and the date of your particular compiler's release.