Running ChaNGa - N-BodyShop/changa GitHub Wiki

Table of Contents Running ChanNGa Initial Conditions and Parameters SMP Architectures netlrts-linux Architectures netlrts-linux-x86_64-cuda MPI Architectures Cray Architectures ChaNGa Output Restarts Restarting from output files Visualization Improving Performance Getting Help Documentation Acknowledgements

Running ChanNGa

Initial Conditions and Parameters

ChaNGa accepts Tipsy files as initial conditions. The running of the program is controlled by either a parameter file or command line switches, in the style of PKDGRAV. See the testcosmo or the teststep subdirectories for example parameter files. ChaNGa --help will give a list of all available options. Their meaning is described in ChaNGa Options. ChaNGa can be run in parallel or in serial. Generally (depending on the architecture) to run in parallel requires starting ChaNGa with the charmrun program. For example

charmrun +p4 ./ChaNGa cube300.param

will start ChaNGa on four processors using the cube300.param parameter file.

Here is a more complicated example:

charmrun +p 4 ++local ./ChaNGa -wall 60 +balancer MultistepLB_notopo cube300.param

++local means run all processes locally and ignore the network. -wall 60 means run for 60 minutes before checkpointing and stopping. +balancer MultistepLB_notopo is specifying a load balancer.

SMP Architectures

SMP refers to Symmetric Multi-Processing, which means many cores on each compute node have access to the same memory space. The charm run time can take advantage of this access and use fewer messages, but the start command needs to be modified to tell ChaNGa about the processor configuration.

If charm is build with the smp option to take advantage of SMP, then when ChaNGa is compiled the executables charmrun.smp and ChaNGa.smp are produced to indicate that SMP execution is compiled in. An example command line to run on 2 nodes with 48 SMP cores each will now look like:

charmrun.smp +p 94 ChaNGa.smp ++ppn 47 +setcpuaffinity +commap 0 +pemap 1-47 test.param

Although, in this example, there are a total of 96 cores available, each node needs one core for communication, so only 94 cores (the +p 94 argument) are available as "workers", 47 per node (the ++ppn 47 argument). Frequently, specifying the layout of the communication threads and worker threads on the cores help performance. Here the +setcpuaffinity +commap 0 +pemap 1-47 arguments specify a layout with a communication thread on core "0" and worker threads on cores 1 to 47.

Sometimes more than one communication thread is needed per node. In the following example, each of two nodes has two sockets, with each socket containing 64 cores. With this many cores, more than one communication thread is likely to be needed so the command would be:

charmrun.smp +p 252 ChaNGa.smp ++ppn 63 +setcpuaffinity +commap 0,64 +pemap 1-63,65-127 test.param

In this case, a total of 4 threads (2 per node) are being used for communication so only 252 of the 256 cores are available for computing. The ++ppn 63 indicates there will be 63 workers for each communication thread, and they will be laid out such that communication threads will be on cores 0 and 64 while the rest of the cores will be used for workers. Laying them out in this order allows each socket to have a communication thread and 63 associated workers. Note that the numbering of cores across sockets can vary by machine so check documentation for the best layout.

netlrts-linux Architectures

The "net" version of charm starts multiple processes by invoking ssh; therefore an ssh server needs to be installed on the target machine. For example, on Redhat/Fedora machines the openssh-server package needs to be installed. yum install openssh-server will accomplish this. If you are using the "net" version to run on a single machine with multiple cores, the use of ssh can be avoided by using the ++local charmrun option. Also by default, ssh requires you to enter your password. This can be avoided by setting up your ssh keys correctly. See the SSH with keys HOWTO for information on how to do this.

netlrts-linux-x86_64-cuda

The GPU version is experimental.

The GPU version of ChaNGa offloads computation to the GPU in chunks called work requests (WR). The interaction of one bucket of particles with a node or another bucket of particles constitutes one unit of computation. Each WR can hold a certain, specified, number of force computations. An appropriate value for the WR size can be specified by the user.

There are several kinds of WR in ChaNGa. WRs that represent the computation between local buckets and local data (either nodes or other buckets) are referred to as 'local'. Similarly, WRs that specify computation of local buckets with remote prefetched data are termed 'remote'. Finally, WRs that specify interaction between local buckets and remote data that haven't been prefetched are termed 'remote-resume'.

ChaNGa provides the following parameters to assign a value for each type of WR:

Local WRs:

-localnodes: bucket - local node computations to offload per WR
-localparts: bucket - local bucket computations

Remote WRs:

-remotenodes: bucket - remote node computations to offload per WR
-remoteparts: bucket - remote bucket computations to offload

Remote-Resume WRs:

-remoteresumenodes: bucket - remote-resume node computations to offload per WR
-remoteresumeparts: bucket - remote-resume bucket computations to offload

Values for these parameters affect the efficiency of kernel execution and the total execution time. For instance, if a WR size is set too high, there is less overlap between work done on the CPU with that done on the CPU. On the other hand values that are too small increase the transfer and kernel invocation overheads associated with each WR.

Appropriate values can be obtained by the following mechanism:

Recompile the ChaNGa CUDA version with -DCUDA_STATS in addition to the other CUDA-specific flags.
This gives the per-iteration count of each type of interaction (localnodes, localparts, remotenodes, remoteparts, remoteresumenodes, remoteresumeparts).
These values can be used to split the total number of interactions into as many pieces (WRs) as deemed appropriate. Some effort might be required to determine appropriate values in this fashion.

The default value for every parameter is 0.1 million.

MPI Architectures

On MPI architectures, you have the option of building the MPI version of charm, and then charmrun is just a shell script wrapper around whatever command is used to start MPI jobs (e.g poe on IBM, mpirun on mpich.) A typical launch command for an MPI job would be

mpiexec ./ChaNGa -wall 600 +balancer MultistepLB_notopo simulation.param

where 600 refers to the minutes of wallclock time requested from the queuing system and MultistepLB_notopo is the specified load balancer.

Another option on many infiniband clusters is to use the native infiniband support. See https://github.com/N-BodyShop/changa/wiki/Machine-Specific-Build-Instructions#Infiniband_Linux_cluster_lonestar_stampede_at_TACC_gordon_at_SDSC_Plieades_at_NAS instructions for details.

Cray Architectures

Many cray machines (xe, xk, and xc series) use aprun to start parallel jobs. Like mpirun, aprun takes the place of charmrun. See the aprun documentation to see how to specify the number of nodes, and the number of cores per node. An example is:

aprun -n 4 -N 1 -d 16 ChaNGa +ppn 15 cube300.param

to start ChaNGa on 4 nodes with one SMP process with 16 threads (15 workers, 1 communication) per node.

See appendix C of the CHARM language manual for more information on parallel execution. Also see Research:ChaNGaPerformanceAnalysis to evaluate how these options affect the parallel performance.

ChaNGa Output

Outputs are also in TIPSY format and are in files that end with the timestep. For example to visualize the final output of the testcosmo simulation, fire up tipsy, and type

openbinary cube300.000128
loadstandard 1.0
zall

This should display the clustering of galaxies on a 300 Mpc scale.

Restarts

It is frequently the case that a simulation will take much more wall clock time than a batch queuing system will allow. In this case, ChaNGa can write checkpoints at regular step intervals (iCheckInterval, and the simulation can be restarted in a subsequent batch submission from one of these checkpoints. A simulation can be restarted from a checkpoint using the syntax:

charmrun +p4 ./ChaNGa +restart cube300.chk0

where cube300.chk0 is an example restart directory. As ChaNGa runs, it produces restart directories with suffixes alternating between .chk0 and .chk1. All parameters will be restored from the checkpoint directory. Only a small subset of the run parameters can be changed in a restart, and only by specifying the changes via command line arguments. These include the base timestep (-dt), the number of timesteps (-n), the wall clock time limit (-wall), the particles/bucket (-b), the output interval (-oi), and the checkpoint interval (-oc).

Restarting from output files

If a restart is needed that involves a substantial change in the run (e.g. changing the version of the code), this can be accomplished by restarting from an output file. In this case the parameter file should be edited such that the parameter achInFile is now the output file from which you wish to restart, and the parameter iStartStep is set to the step number of that file.

Visualization

ChaNGa now (as of 7/2009) has on demand visualization capabilities via the liveViz module of CHARM++. To use it, set bLiveViz = 1 in the parameter file, and start ChaNGa with

charmrun +p4 ++server ++server-port NNNNN ./ChaNGa run.param

where NNNNN is an unused TCP port number. Images of the running simulation can be optained by using the liveViz java client from the CHARM++ distribution in java/bin/liveViz. The syntax is liveViz hostname NNNNN where hostname is the machine on which charmrun is running, and NNNNN is the port number given above. A window will pop up with an image that will continually be refreshed from the running program. The image view is controlled by the .director file. See Research:ChaNGaOptions#Movie_Making_options.

Improving Performance

See Research:ChaNGaPerformanceAnalysis for tools to measure and improve the performance of ChaNGa.

Getting Help

An email list has been set up at [email protected]. Please subscribe to the list before posting to it.

Bugs and feature requests can be submitted to the NChilada product of our Redmine server.

Also check out our list of Research:ChaNGa Issues for common errors when running ChaNGa.

Documentation

Internal code documentation using doxygen is partially done.

While there is no comprehensive body of documentation detailing the ChaNGa code, the recent refactoring efforts are outlined and discussed here. The refactoring process unearthed the answers to some nuances of the existing code as well, so one would do well to look through these articles.

Acknowledgements

The development of ChaNGa was supported by a National Science Foundation ITR grant PHY-0205413 to the University of Washington, and NSF ITR grant NSF-0205611 to the University of Illinois. Contributors to the program include Graeme Lufkin, Tom Quinn, Rok Roskar, Filippo Gioachin, Sayantan Chakravorty, Amit Sharma, Pritish Jetley, Lukasz Wesolowski, Edgar Solomonik, Celso Mendes, Joachim Stadel, and James Wadsley.