Specific instructions for ChaNGa builds on various parallel architectures are documented below. Preferred Charm++ build options are noted. If the machine corresponds to a NSF XSEDE or other national facility, the particular machine will be noted.

Table of Contents x86_64 Linux Workstation Cluster SMP (Symmetric Multi-Processing) Multicore Linux Apple Mac Cray XE6/XK6 (Blue Waters at NCSA, fish at ARSC) gni-crayxe gni-crayxe-cuda Infiniband Linux cluster (Pleiades at NAS, Expanse at SDSC, Bridges-2 at PSC) PSC Bridges-2 SDSC Expanse TACC Frontera NASA Pleiades Compute Canada Niagara mpi-linux-x86_64 verbs-linux-x86_64 Parallel start without MPI GPU cluster: verbs-linux-x86_64 cuda TACC Vista Omni-Path/Intel cluster (Stampede 3 at TACC) UW clusters hyak hyak GPU mox (hyak 2) klone (hyak 3) Retired Machines Knights Landing cluster (Stampede 2 at TACC) Skylake cluster (Stampede 2 at TACC) IBM Bluegene/L (frost at NCAR) SDSC comet IBM Power5 (BluePrint at NCSA) lapi Issues SGI Altix (Cobalt at NCSA) mpi-linux-ia64-mpt-icc IBM SP4 (Copper at NCSA) mpi-sp Cray XT3 (Bigben at PSC) mpi-crayxt3-gcc4 mpi-crayxt3 Cray XT5 (kraken at NICS) mpi-crayxt GPGPU Cluster (forge at NCSA)

x86_64 Linux Workstation Cluster

In this configuration, Charm uses the UDP protocol for communicating over the network.

Most recent machines are 64 bit. In this case, build charm with

./build ChaNGa netlrts-linux-x86_64 --with-production

Then configure and make ChaNGa. For 32 bit machines, omit the 'x86_64' in the charm build command.

SMP (Symmetric Multi-Processing)

Almost all modern clusters are built from "nodes", each with many compute cores. Within a node, the cores can directly access each other's memory, while communication between nodes is done via a message passing protocol (e.g., MPI). On these architectures Charm++ and ChaNGa can be compiled in such a way to take advantage of the shared memory within a node to reduce the amount of communication. This is done at the time charm is built by adding smp to the build command line, e.g.,

 ./build ChaNGa netlrts-linux-x86_64 smp --with-production

Then "configure" and "make" ChaNGa the same as a non-SMP build.

Running an SMP build presents a lot of options that will impact performance. On a given physical node, one can run one or more ChaNGa processes, where each process has one thread for communication and many worker threads. Furthermore, each thread can be tied to a particular core, which can impact performance based on the core and memory layout of the node. For example, a node may have two CPU chips in two separate sockets on the motherboard, in which case, better performance may be gotten by having two ChaNGa processes, each running its communication and worker threads on one of the CPU chips. To be even more specific, consider a two socket Intel Ivy Bridge node with 12 cores on each chip, the command line to run on 4 such nodes would be:

charmrun +p 88 ChaNGa ++ppn 11 +setcpuaffinity +commap 0,12 +pemap 1-11,13-24 sample.param

In this case, 8 processes would be created (88/11), with 2 processes on each node. The first processes on each node will have its communication thread on core "0", and 11 worker threads on cores 1 through 11 (all on the first chip), while the second processes will have its communication thread on core "12", and 11 worker threads on cores 13 through 23 (all on the second chip).

When using mpiexec or mpirun, the command line layout is slightly different. The above example would be:

mpiexec -np 8 ChaNGa ++ppn 11 +setcpuaffinity +commap 0,12 +pemap 1-11,13-24 sample.param

Again, 8 processes (or MPI ranks, or "virtual nodes") are created, two on each of 4 hardware nodes, with 11 worker threads per process.

Some architectures can have multiple virtual cores per physical core, referred to as hyperthreading. ChaNGa generally does not benefit from hyperthreading.

See the Charm++ SMP documentation for other ways to specify layout of processes and threads.

Multicore Linux

For a single multicore machine, ChaNGa can be built to utilize all the cores. In this case build charm with

./build ChaNGa multicore-linux-x86_64

./build ChaNGa multicore-linux-i386

depending on whether you are running 64 bit or 32 bit linux. Then configure and make ChaNGa.

Apple Mac

The Mac OS is subtly different than linux. First be sure you have the development tools installed with xcode-select --install. Two additional packages need to be installed with homebrew via the command brew install autoconf automake. The charm system can then be built with:

./build ChaNGa netlrts-darwin-x86_64 smp --with-production -j4

For older versions of charm (before March, 2019), you may have to add -stdlib=libc++ at the end of the above command. Then you can cd into the ChaNGa source directory and then "configure" and "make". For older versions of ChaNGa (before November, 2018), "sinks.cpp" will have an undeclared identifier "MAXPATHLEN". Either upgrade ChaNGa, or change "MAXPATHLEN" to "PATH_MAX" in sinks.cpp.

Cray XE6/XK6 (Blue Waters at NCSA, fish at ARSC)

The Cray XC series is very similar since it uses the same GNI interface to the network. For XC series, replace gni-crayxe with gni-crayxc below.

gni-crayxe

Charm runs natively on the Gemini interconnect used by the XE6/XK7 series. With 32 cores/node, the "SMP" version of charm offers advantages. Running out of memory in the GNI layer can be a problem. This is fixed with the hugepage option below.

Switch to the GNU programming environment: module swap PrgEnv-cray PrgEnv-gnu
Load the cray resiliency communication agent (RCA) library with module load rca
Load the hugepage module with module load craype-hugepages2M
Build charm with ./build ChaNGa gni-crayxe hugepages smp -j4 --with-production
Configure and make ChaNGa

When running, set the following environment variables in the job script, (assuming BASH)

export HUGETLB_DEFAULT_PAGE_SIZE=2M
export HUGETLB_MORECORE=no            # this line may give problems on small core counts

Note that on Cray architectures, one usually uses aprun, not charmrun to start parallel programs. A typical aprun command would look like:

aprun -n 8 -N 1 -d 32 ./ChaNGa -p 4096 +ppn 31 +setcpuaffinity +pemap 1-31 +commap 0 dwf1.2048.param

where -n 8 starts 8 processes, -N 1 puts 1 process on each physical node, -d 32 reserves 32 threads per process, -p 4096 divides the simulation into 4096 domains, +ppn 31 request 31 worker threads per process and +setcpuaffinity +pemap 1-31 +commap 0 explicitly maps the threads to CPU cores with the worker threads going on cores 1 to 31, and the communication thread going on core 0.

gni-crayxe-cuda

GPU support is in development.

In addition to the above:

Load the CUDA development environment with module load cudatoolkit
Use the CUDA_DIR environment variable to point at this environment: export CUDA_DIR=$CRAY_CUDATOOLKIT_DIR
Build charm with ./build ChaNGa gni-crayxe cuda hugepages -j4 --with-production

The charm build can fail with

CrayNid.c: In function 'getXTNodeID': CrayNid.c:32:2: error: #error "Cannot get network topology information on a Cray build. 
Swap current module xt-mpt with xt-mpt/5.0.0 or higher and xt-asyncpe with xt-asyncpe/4.0 or higher and then rebuild

This can be fixed by setting the following environment variable before running the build command:

export PE_PKGCONFIG_LIBS=cray-pmi:cray-ugni:$PE_PKGCONFIG_LIBS

Configure ChaNGa with ./configure --with-cuda=$CUDA_DIR, then make.

If you run with more than one process per node, set the "CRAY_CUDA_MPS" environment variable to "1" to enable the CUDA multi-process service which allows more than one process to talk to the GPU.

As of v3.3, the GPU build of ChaNGa can run in SMP mode (one process, multiple threads). To build for this mode, replace the charm build command above with ./build ChaNGa gni-crayxe cuda hugepages smp -j4 --with-production before compiling ChaNGa.

Infiniband Linux cluster (Pleiades at NAS, Expanse at SDSC, Bridges-2 at PSC)

On an infiniband cluster there are two options for building ChaNGa. The most straightforward option is using MPI (the mpi-linux-x86_64 build below), but occasionally the verbs-linux-x86_64 build may work better.

PSC Bridges-2

Bridges 2 has 128 CPU cores per node. ChaNGa running on this many cores generates a lot of messages and cause problems with the MPI implementations. As of March 2021, the charm verbs build seems to be the only machine layer that works and scales well, and that with a more recent version of charm++. The procedure at the moment is:

Checkout version v7.0.0 of charm
load the "mvapich2/2.3.5-gcc8.3.1" and the "python/2.7" modules. The "mpi" module is just to get at an "mpiexec" for the sbatch submission.

build charm with

 ./buildold ChaNGa verbs-linux-x86_64 smp -j8 --with-production

build ChaNGa with the usual configure and make.
The run line in your sbatch script should look like (e.g. for running on 4 nodes):

./charmrun.smp +p 504 ++mpiexec ./ChaNGa.smp ++ppn 63 +setcpuaffinity +commap 0,64 +pemap 1-63,65-127 +IBVBlockAllocRatio 1024 +IBVBlockThreshold 11 XXX.param

This runs 2 SMP processes on each node, one per socket. The IBVBlock flags allocate bigger chunks of pinned memory for the Infiniband card. IBVBlockAllocRatio specifies how many messages in a single allocation. IBVBlockThreshold is related to the message size threshold below which messages are allocated in blocks, where the byte threshold is 128*2^IBVBlockThreshold bytes. The current (7/24/24) default for these numbers are +IBVBlockAllocRatio 128 and +IBVBlockThreshold 9.

If you are having trouble running with mpiexec, you can generate a nodelist and use ssh for the spinup. An example SLURM run script would look like this:

cd $SLURM_SUBMIT_DIR

# Generate node list
echo "group main ++shell /usr/bin/ssh ++cpus $SLURM_CPUS_ON_NODE" > nodelist
for i in `scontrol show hostnames $SLURM_NODELIST`
do
        echo host $i >> nodelist
done

./charmrun.smp +p 504 ./ChaNGa.smp +nodelist ./nodelist ++ppn 63 +setcpuaffinity +commap 0,64 +pemap 1-63,65-127 +IBVBlockAllocRatio 1024 +IBVBlockThreshold 11 XXX.param

SDSC Expanse

This machine is very similar to Bridges-2: 128 cores per node. See the Bridges-2 instructions.

Update 7/12/24: The MVAPICH2 implementation on Expanse has some corruption issue that I have not resolved. I currently recommend using the verbs build as described in the Bridges-2 instructions, and ignore the following.

While the OpenMPI implementation on this machine tends to fail, MVAPICH2 is also installed, and that implementation works well. The modules that need to be loaded in the following order: cpu/0.17.3b gcc/10.2.0/npcyll4 mvapich2/2.3.7 slurm. To run with mpi (rather than verbs) a typical run command with 2 SMP processes per node would be the following (follow Bridges-2 directions above if running with verbs):

#SBATCH --nodes=8
#SBATCH --ntasks-per-node=2
...
srun --mpi=pmi2 -n 16 ./ChaNGa.smp ++ppn 63 +setcpuaffinity +commap 0,64 +pemap 1-63,65-127 XXX.param

See the SDSC Expanse documentation at https://www.sdsc.edu/support/user_guides/expanse.html#running for more information.

TACC Frontera

This is another Infiniband machine with lots of cores per node, in this case 56. See the instructions for Bridges-2, but now the run command would be

./charmrun.smp +p 216 ++mpiexec ./ChaNGa.smp ++ppn 27 +setcpuaffinity +commap 0,28 +pemap 1-27,29-55 XXX.param

to run 2 SMP (one on each socket) per node on 4 nodes.

NASA Pleiades

With the Fall 2021 "upgrade" of the operating system on Pleiades, the default MPI implementation (mpi-hpe/mpt) no longer works. However, the Intel MPI implementation is installed, and it seems to work. Update 5/2/22: The Intel MPI implementation is also based on UCX, which has known problems with large numbers of messages. While the following works for most jobs, more network intense jobs (e.g. toward the end of a zoom simulation) will fail with UCX errors. In that case, use the verbs build as described under PSC Bridges-2 above.

To use Intel MPI, load the mpi-intel module, then follow the directions for mpi-linux-x86_64 below. For jobs on larger node counts, the smp can be used (see above). For SMP with MPI, care must be taken with the PBS options. For example, to run ChaNGa on 24 "Ivy" nodes where each node has two Intel Ivybridge sockets with 10 cores each, one uses the PBS line

#PBS -lselect=24:ncpus=20:mpiprocs=2:model=ivy

The mpiprocs option is saying to run only two MPI processes per node. The corresponding command to start ChaNGa is:

mpiexec $PWD/ChaNGa ++ppn 9 +setcpuaffinity +commap 0,10 +pemap 1-9,11-19 XXX.param

which puts one MPI process with 9 worker threads and 1 communication thread on each socket.

Compute Canada Niagara

The intel compilers and MPI distribution seems to work best, gcc and openmpi can run into issues with hanging.

module load intel intelmpi autotools

Charm can then be built with ./build ChaNGa mpi-linux-x86_64 --with-production mpicxx

A basic submission script looks like this:

#!/bin/bash
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=40
#SBATCH --time=4:00:00
#SBATCH --job-name=mychangajob
#SBATCH --output=%x.%j_%A.out
 
cd $SLURM_SUBMIT_DIR

module load intel intelmpi

charmrun ++mpiexec ++remote-shell mpirun +p 320 ChaNGa +balancer MultistepLB_notopo BLAH.param

which can then be run with sbatch

mpi-linux-x86_64

Charm should be built with ./build ChaNGa mpi-linux-x86_64 --with-production

Then in the changa directory type

./configure; make

to create the charmrun and ChaNGa executables. Note again that in this case the charmrun executable is just a wrapper around MPI startup commands. Instead of using charmrun, ChaNGa can be started like any other MPI program, e.g., with mpiexec or mpirun.

verbs-linux-x86_64

Charm has a native infiniband driver that is more efficient than using MPI. To use it first build charm with

./build ChaNGa verbs-linux-x86_64 --with-production

Then in the changa directory type

./configure; make

to create the charmrun and ChaNGa executables.

Even if mpi is not being used, the MPI infrastructure is useful for starting ChaNGa. Charmrun has a ++mpiexec option that takes advantage of this infrastructure. For example,

charmrun +p 144 ++mpiexec ChaNGa -wall 600 + balancer MultistepLB_notopo simulation.param

Charmrun assumes that "mpiexec" is used, but stampede uses "ibrun". Therefore a small shell script is needed to overcome this difference. Call it "mympiexec" and it will contain:

#!/bin/csh
shift; shift; exec ibrun $*

Then call charmrun with (e.g.): charmrun +p 36 ++mpiexec ++remote-shell mympiexec ChaNGa -wall 60 +balancer OrbLB Disk_Collapse_1e6.param

Parallel start without MPI

If the MPI runtime is not available, or you don't wish to use it, charmrun needs a nodelist file to inform it which nodes are available to run on. An example is:

group main ++shell /usr/bin/ssh
host maia0
host maia1

Call this file nodelist and have it in the directory from which you run ChaNGa. The node names can be found from the queueing system. For example, in the PBS system, one can use the short script:

#!/bin/bash
echo 'group main ++shell /usr/bin/ssh' > nodelist
for i in `cat $PBS_NODEFILE` ; do
   echo host $i >> nodelist
done

to create nodelist. ChaNGa can then be started with charmrun +p 2 ChaNGa +nodelist ./nodelist simulation.param

For SLURM systems the above nodelist generation script can be written as:

#!/bin/bash
echo 'group main ++shell /usr/bin/ssh' > nodelist
for i in `scontrol show hostnames` ; do
  echo host $i >> nodelist
done

GPU cluster: verbs-linux-x86_64 cuda

GPU support is still experimental. Also, on many machines the CUDA device can only be accessed by one process on a node. Hence charm needs to be built with the SMP option so that all cores can use the GPU, and only one charm process is running per node (mpiprocs=1 in the PBS -lselect option, and +p N ++ppn M options such that N/M equals the number of GPU nodes used.)

For any of the machines below that have GPUs more advanced than Kepler, special compile flags need to be passed to the NVidia compiler that depend on the machine architecture. We don't have an automatic way of detecting the GPU architecture (particularly when you are compiling on a different host), so an appropriate cuda-level needs to be added to the ChaNGa configure command. For Pascal GPUs (P100), add --with-cuda-level=60 to the configure line. For Volta GPUs (V100) add --with-cuda-level=70. The GPU code can be compiled to perform part of the tree-walk on the GPU with --enable-gpu-local-tree-walk. Note that the gravity algorithm is slightly different (more like the traditional Barnes-Hut) with this option.

To build for CUDA on NAS Pleiades:

As of June 2017, the build steps need to be done on one of the GPU nodes. Use "qsub -I -q gpu_k40" to get an interactive session.
Load the CUDA development environment with module load cuda
If you haven't already, load a modern C compiler with module load gcc. (Intel should also work, but gcc also needs to be loaded for its libraries: module load gcc; module load comp-intel)
Set the CUDATOOLKIT_HOME environment variable to point at the development environment. Use which nvcc to find the directory, then, e.g., set the environment variable with setenv CUDATOOLKIT_HOME /nasa/cuda/8.0.
In the charm source directory, build charm with ./build ChaNGa verbs-linux-x86_64 cuda smp -j4 --with-production. Note that built with this configuration, charm++ can only be used to compile ChaNGa on nodes with the full CUDA development environment.
In the changa source directory, configure ChaNGa with ./configure --with-cuda=$CUDATOOLKIT_HOME.
Compile ChaNGa with "make".

Running ChaNGa with this build is the same as a CPU only verbs build see instructions there for an example.

To build for CUDA on Maverick or Stampede:

Load the CUDA development environment with module load cuda
Use the CUDA_DIR environment variable to point at this environment: export CUDA_DIR=$TACC_CUDA_DIR
Build charm with ./build ChaNGa verbs-linux-x86_64 cuda smp -j4 --with-production
Configure ChaNGa with ./configure --with-cuda=$CUDA_DIR
Compile ChaNGa with "make"

To build for CUDA on SDSC comet:

Load the CUDA development environment with module load cuda
Build charm with ./build ChaNGa verbs-linux-x86_64 cuda smp -j4 --with-production
Configure ChaNGa with ./configure --with-cuda
Compile ChaNGa with "make"

To build for CUDA on SDSC Expanse:

You must compile in an interactive environment on a GPU node. Use

srun --partition=gpu-debug --pty --account=<<project>> --nt</asks-per-node=10>\
    --nodes=1 --mem=96G --gpus=1 -t 00:30:00 --wait=0 --export=ALL /bin/bash

to get an interactive session.

Get the modules reset: module purge; module restore; module load cuda. Do NOT load the gcc module.
Build charm with ./build ChaNGa verbs-linux-x86_64 cuda smp -j4 --with-production
Configure ChaNGa with ./configure --with-cuda --with-cuda-level=70. The 70 level is for the V100 GPU.
Compile ChaNGa with make.

TACC Vista

This is an infiniband machine with NVidia "grace-hopper" nodes. "grace" is a 72 core ARM CPU chip, and "hopper" is a CUDA level 9.0 GPU. To compile the verbs layer in charm:

Checkout the "main" branch that has the ARM8 verbs port (added October, 2024).
Build charm with ./build ChaNGa verbs-linux-arm8 cuda smp -j8 --with-production
Configure ChaNGa with ./configure --with-cuda --with-cuda-level=90 --enable-gpu-local-tree-walk CPPFLAGS="-march=native"
Compile ChaNGa with make

ChaNGa can be run with the charmrun.smp script. In the job file use the following shell commands:

for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do
    echo "host $n cpus 144" >> nodelist.$SLURM_JOBID
done
./charmrun.smp +p $wproc ++nodelist nodelist.$SLURM_JOBID ./ChaNGa.smp <SMP flags> xxx.param

where wproc is the number of worker processors and the "SMP flags" are the flags needed for thread number and placement described above.

Vista jobs can be run on the UCX layer (but see caveats about UCX above.) In this case, build charm with /build ChaNGa ucx-linux-arm8 cuda smp -j8 --with-production. To run with UCX, use the charmrun.smp utility, similar to verbs, but there is no need to construct a nodelist or use the ++nodelist option.

Omni-Path/Intel cluster (Stampede 3 at TACC)

The interconnect on the Stampede 3 cluster uses the Intel OmniPath architecture which does not work well with the verbs API. Use an MPI build instead:

./build ChaNGa mpi-linux-x86_64 smp mpicxx --with-production --march=skylake

configure ChaNGa with the --enable-avx flag.

After configuring ChaNGa, edit the Makefile to add -march=skylake flag to the opt_flag line. The -march=skylake flag allows ChaNGa to take advantage of the AVX2 vector instructions to calculate gravity. The Skylakes (SKX) are the lowest common denominator on the Stampede 3 system. If you are going to be running exclusively on the Icelake (ICX) or Sapphire Rapids (SPR) nodes, the -march flag can be changed accordingly.

All the compute nodes have two sockets on each node, so having at least 2 SMP processes per node helps with performance. Be aware that the mapping betweens cores and sockets is different than most other machines: all the even numbered cores are on one socket and all the odd numbered cores are on the other. For good performance all the threads of an SMP process should be on a single core, so a typical run command (on a Skylake (spx) partition) would be:

ibrun ./ChaNGa ++ppn 23 +setcpuaffinity +commap 0,1 +pemap 2-46:2,3-47:2 xxx.param

In this case there will be 2 mpi tasks per physical Skylake node, e.g. the sbatch command will be something like sbatch -n 10 -N 5 xxx.job, where "10" is the total number of tasks, and 5 is the number of nodes, each of which is running 2 tasks. Within a given node, the first task will use core 0 to communicate, and cores 2, 4, 6, ..., 46 as workers, while the second task will use core 1 to communicate, and cores 3, 5, 7, ..., 47 as workers. If your simulation has a lot of communication, you might get better performance with more commication threads, which means more tasks. To run 4 mpi tasks per physical Skylake node, the sbatch command will be something like: sbatch -n 20 -N 5 xxx.job, so there are now 20 total tasks on 5 nodes, and each node will run 4 tasks. To get an efficient thread layout, the run command would be:

ibrun ./ChaNGa ++ppn 11 +setcpuaffinity +commap 0,24,1,25 +pemap 2-22:2,26-46:2,3-23:2,27-47:2 xxx.param

UW clusters

hyak

The original hyak nodes (ITK) have out-of-date compilers which are unable to compile recent versions of Charm++ and ChaNGa. If you must use the old hyak nodes, use charm version 6.8.0 or eariler, and ChaNGa version 3.3 or earlier. However, it is recommend that you move to the new MOX nodes (see below.)

hyak GPU

updated 04/14/17

There are some GPU enabled nodes on hyak. Currently the vsm group has 1 GPU node.

Steps:

Download charm++ from github:

 git clone https://github.com/UIUC-PPL/charm

As of writing, this works with the current development version of charm++ (the default charm branch). Request a GPU node by submitting an interactive job to the GPU queue, eg:

 qsub -IV -q gpu -l walltime=2:00:00

Load CUDA, find the cuda toolkit directory, and choose the directory corresponding to the cuda version loaded.

 module load cuda_7
 ls -d /sw/cuda*
 export CUDATOOLKIT_HOME=/sw/cudatoolkit-7.0.28

This will point charm to the right directory. cd into the charm++ directory and build it:

 ./build ChaNGa mpi-linux-x86_64 cuda -j12

(-j12 assumes you are on a 12 core hyak node). cd into the ChaNGa directory, configure ChaNGa to use cuda and build it:

 ./configure --with-cuda=$CUDATOOLKIT_HOME
 make -j 12

Quick test results for the testcollapse simulation show a factor of 6x speedup:

 Walltime with cuda: 0m48.023s
 Walltime without cuda: 4m55.079s

mox (hyak 2)

Updated 06/26/17

When building on mox, make sure to request an interactive session on a compute node, e.g.:

 srun -t 0 -N 1 -p vsm --pty /bin/bash

Then load the gnu compiler with intel mpi

 module load gcc_4.8.5-impi_2017

Build charm with mpi linux, without SMP (tested for speed protoplanetary disks). Mox nodes have 28 cores, so you can use a lot.

 ./build ChaNGa mpi-linux-x86_64 -j20

ChaNGa can be built with defaults.

When submitting jobs with sbatch, you should not need to specify the number of tasks, just the number of nodes.

 ChaNGa should be run with mpirun.  For a job on 7 nodes for 1 day, your submission script can look like:

#!/bin/bash -l
#SBATCH -N 7
#SBATCH -J jobname
#SBATCH -t 24:00:00
#SBATCH -p vsm
#SBATCH --mem=500G
#SBATCH --mail-type=ALL [email protected]
cd path/to/run
mpirun ChaNGa paramfile.param &> stdoutfile

klone (hyak 3)

Updated 01/10/23

Building ChaNGa on Klone is very similar to the build on Mox. To get an interactive node on Klone (in this case, using the stf partition), run the command

salloc -A stf -p compute-int -N 1 --time=00:30:00

The main difference is that the intel mpi compiler must be loaded with a slightly differently

module load stf/mpich/4.0a2

Retired Machines

Knights Landing cluster (Stampede 2 at TACC)

The interconnect on the Stampede 2 KNL cluster uses the Intel OmniPath architecture which does not work well with the verbs API. Use an MPI build instead:

./build ChaNGa mpi-linux-x86_64 smp mpicxx --with-production -xCORE-AVX2 -axMIC-AVX512

The final two flags compiles code for both the Haswell login node (needed for the compile tools) and the KNL. When configuring ChaNGa, enable AVX2 (--enable-avx) to take advantage of the KNL floating point units. Also add -xCORE-AVX2 -axMIC-AVX512 to the C/C++ compiler flags in the Makefile.

Since this is an MPI build, ChaNGa can be executed from within a batch script. The Omnipath network requires more CPU for communication, so multiple processes/node is helpful. Four processes/node can be specified in the sbatch command with a "-N nnn" argument where "nnn" is the total tasks divided by four. E.g. for a job running on 8 nodes, use sbatch -n 32 -N 8 xxx.qsub. Then in the script, ChaNGa is executed with:

ibrun ./ChaNGa ++ppn 16 xxx.param

Thread layout on the KNL chip may be important for performance. To divide the chip into four equal quadrants (this is close to, but not quite how the hardware is laid out) add the following options:

+setcpuaffinity +commap 0,17,34,51 +pemap 1-16,18-33,35-50,52-67

Hyperthreading may also help. To use two hyperthreads per core (note that this will have 128 threads total on a node), the command would be:

ibrun ./ChaNGa ++ppn 32 +setcpuaffinity +commap 0,17,34,51 +pemap 1-16+68,18-33+68,35-50+68,52-67+68 xxx.param

Yes, this is painful. It is hoped that future versions of Charm++ will make hyperthreading syntax easier.

Update from Jim Phillips, NAMD developer: It's a bad idea to split tiles across SMP nodes, so the pemaps should start on even PEs. Furthermore, the comm threads can be anywhere on the chip since they are going to the network anyway. His preferred map is therefore:

ibrun ./ChaNGa ++ppn 32 +setcpuaffinity +commap 64-67 +pemap 0-63+68 xxx.param

Furthermore, recalling that the Intel Openfabric network needs a lot of CPU, the following gives more communication processors:

ibrun ./ChaNGa ++ppn 20 +setcpuaffinity +commap 60-65 +pemap 0-59+68 xxx.param

In this case there will be 6 mpi tasks per physical KNL node, e.g. the sbatch command will be something like sbatch -n 30 -N 5 xxx.job, where "30" is the total number of tasks, and 5 is the number of nodes, each of which is running 6 tasks.

Skylake cluster (Stampede 2 at TACC)

Stampede2 also includes a Skylake partition. Much of the description for the KNL partition holds here since the interconnect is the same, but details of the processors are quite different. Build the MPI target of charm with:

./build ChaNGa mpi-linux-x86_64 smp mpicxx --with-production

configure ChaNGa with the --enable-avx flag.

After configuring ChaNGa, edit the Makefile to add -xCORE-AVX2 flag to the opt_flag line. The -xCORE-AVX2 flag allows ChaNGa to take advantage of the AVX2 vector instructions to calculate gravity.

The Skylake nodes have two sockets on each node, so having 2 SMP processes per node helps with performance. Be aware that the mapping betweens cores and sockets is different than most other machines: all the even numbered cores are on one socket and all the odd numbered cores are on the other. For good performance all the threads of an SMP process should be on a single core, so a typical run command would be:

ibrun ./ChaNGa ++ppn 23 +setcpuaffinity +commap 0,1 +pemap 2-46:2,3-47:2 xxx.param

ibrun ./ChaNGa ++ppn 11 +setcpuaffinity +commap 0,24,1,25 +pemap 2-22:2,26-46:2,3-23:2,27-47:2 xxx.param

IBM Bluegene/L (frost at NCAR)

Use ./build ChaNGa mpi-bluegenel -O3 to compile charm++ with the GCC compiler and bluegene specific communication library, or ./build ChaNGa mpi-bluegenel xlc to compile charm++ with the IBM C compiler and a bluegene specific communication library. The IBM C compiler (v. 9) introduces bugs at high optimizations, so beware.

This architecture does not come with an XDR library which ChaNGa uses for machine independent output. For this machine a compiled version of the XDR library is provided on our distribution site. Download the file xdr.tgz from the distribution site http://faculty.washington.edu/trq/hpcc/distribution/changa/ and unpack it in the ChaNGa directory. The configure script for ChaNGa will then detect it and link to it appropriately.

Previous problems with linking in the XDR library have been fixed.

SDSC comet

The mvapich2_ib MPI implementation seems to be the most performant. To use this, load the intel module and the mvapich2_ib model, then follow the directions for mpi-linux-x86_64 below.

IBM Power5 (BluePrint at NCSA)

lapi

For systems with a Federation switch (ARSC iceberg), directly using IBM's communication layer may give better performance.
May need to use gmake instead of make (depending on which make are installed)
A few extra options are needed: -qstrict and -qrtti=dyna
- Make using make OPTS="-O3 -qstrict -qrtti=dyna"
Builds fine otherwise.
Note that charmrun is just a wrapper around poe. It is more robust just to use poe to start a parallel job.

Issues

If there is a complain at runtime that libjpeg cannot be loaded, modify conv-autoconfig.h in the tmp directory of charm++. Enter the libs/ck-libs/liveViz directory and make clean; make.

SGI Altix (Cobalt at NCSA)

mpi-linux-ia64-mpt-icc

Builds and runs out of the box.

IBM SP4 (Copper at NCSA)

mpi-sp

May need to configure --host aix if the C/C++ compiler produces an executable that needs an MPI environment to run.
Need to use gmake instead of make
Builds fine otherwise.
Note that charmrun is just a wrapper around poe. It is more robust just to use poe to start a parallel job.

Cray XT3 (Bigben at PSC)

The Cray OS (catamount) does not have the xdr library available. Download it from our distribution site, and compile it with "gcc" before building ChaNGa.

mpi-crayxt3-gcc4

The following commands need to be executed before Charm and ChaNGa can be built.

module load gcc/4.0.2

module remove acml
configure needs to be run as ./configure -host linux since the cray front end is actually a cross-compilation environment.
charmrun doesn't work on bigben. Use the standard pbsyod to run ChaNGa.

mpi-crayxt3

This uses the default pathScale compiler which may give better performance. However building charm is a little tricky.

Use ./build charm++ mpi-crayxt3 to build charm.
Change to the tmp directory and edit conv-mach.sh and change the CMK_SEQ_CC and CMK_SEQ_LD definitions to gcc.
Type make charm++ to rebuild with these changes.
Build ChaNGa with the standard configure and make commands.
As above, use pbsyod to run.

If you do not make the changes above, you will get a mysterious alloca() error.

Cray XT5 (kraken at NICS)

mpi-crayxt

For some reason, at this date (July 2009) the PGI compiler produces code that is an order of magnitude slower (!!!) than code from the GCC compiler. The procedure is therefore as follows.

Switch to the GNU programming environment: module swap PrgEnv-pgi PrgEnv-gnu
Build charm with ./build ChaNGa mpi-crayxt -O3.
configure and make ChaNGa

GPGPU Cluster (forge at NCSA)

These directions will also help with a single workstation with a CUDA capable GPU.

When building charm++, it needs to know where the CUDA compiler lives. This can be set with an environment variable; here is an example for forge:

 export CUDA_DIR=/usr/local/cuda_4.0.17/cuda
 ./build ChaNGa net-linux-x86_64 cuda -O2

Once ChaNGa is configured, the Makefile needs to be edited to point CUDA_DIR and NVIDIA_CUDA_SDK at the above directories, but also to uncomment the CUDA = ... line. Also CUDA does not handle hexadecapole multipole moments, so the HEXADECAPOLE = line needs to be commented out.