Running simulations on the TACC Stampede2 - chunshen1987/iEBE-MUSIC GitHub Wiki

This page explains the setup of the Stampede2 Cluster at the Taxes Advanced Computing Center (TACC).

First time set up (only need to do once)

Log in to the Stampede2 using ssh [UserName]@stampede2.tacc.utexas.edu (The login node could be different for different users).

Once you log in to the Stampede2, you can request an interactive node by

idev

Then you can load the essential modules

module load tacc-singularity
  • Download the singularity image of the iEBE-MUSIC framework,
mkdir -p $WORK/singularity_repos/
cd $WORK/singularity_repos/
singularity pull docker://chunshen1987/iebe-music:dev
  • Download the iEBE-MUSIC framework code under your work directory,
cd $WORK
git clone https://github.com/chunshen1987/iEBE-MUSIC -b dev

After you finish these steps, you can exit the computing node by typing exit and return to the login node for the job submissions described below.

Running Simulations

To generate and run a batch of simulations, you need to start from your work directory,

cd $WORK/iEBE-MUSIC/
./generate_singularity_jobs.py -w $SCRATCH/[runDirectory] -c stampede2 --node_type SKX -n 24 -n_hydro 1 -n_th 4 -par [parameterFile.py] -singularity $WORK/singularity_repos/iebe-music_dev.sif -b [bayesParamFile]

Here [runDirectory] needs to be replaced with the information collision system. The file [parameterFile.py] needs to be replaced by the actual parameter file. You can find example parameter files in the config/ folder. All the model parameters are listed in the config/parameters_dict_master.py file. The training parameters for the Bayesian emulator can be specified using the -b option with a parameter file [bayesParamFile]. Please note that you must provide the absolute path for the [bayesParamFile].

The option -n specifies the number of jobs to run. On Stampede2, we recommend setting the number of hydro events to simulation per job -n_hydro to 1 so that we can set the shortest walltime for each batch of jobs. With -n_hydro 1, the option -n effectively sets the number of total events one wants to simulate in the batch.

On Stampede2, the option --node_type has the available options, SKX, KNL, and ICX on Stampede2. When using KNL nodes, we recommend setting the number of openMP threads -n_th 16. For SKX and ICX nodes, -n_th 4.

If you set the singularity image with a different name, you need to replace the iebe-music_dev.sif with your customized name.

The full help message can be viewed by ./generate_singularity_jobs.py -h,

usage: generate_singularity_jobs.py [-h] [-w] [-c] [--node_type] [-n]
                                    [-n_hydro] [-n_th] [-par] [-singularity]
                                    [-exe] [-b] [-seed]

⚛ Welcome to iEBE-MUSIC package

  -h, --help            show this help message and exit
  -w , --working_folder_name 
                        working folder path (default: playground)
  -c , --cluster_name   name of the cluster (default: local)
  --node_type           name of the queue (work on stampede2) (default: SKX)
  -n , --n_jobs         number of jobs (default: 1)
  -n_hydro , --n_hydro_per_job 
                        number of hydro events per job to run (default: 1)
  -n_th , --n_threads   number of threads used for each job (default: 1)
  -par , --par_dict     user-defined parameter dictionary file (default:
                        parameters_dict_user.py)
  -singularity , --singularity 
                        path of the singularity image (default: iebe-
                        music_latest.sif)
  -exe , --executeScript 
                        job running script (default:
                        Cluster_supports/WSUgrid/run_singularity.sh)
  -b , --bayes_file     parameters from bayesian analysis (default: )
  -seed , --random_seed Random Seed (-1: according to system time) (default: -1)

After running this script, the working directory $SCRATCH/[runDirectory] will be created. You can then submit the job by

cd $SCRATCH/[runDirectory]
sbatch submit_MPI_jobs.script

Make sure you are on the login node to submit jobs. If you are on a computing node (such as running idev), you will get an error message from the system when you try to submit a computing job. In this case, you would need to exit the computing node and go back to the login node and submit the job from there.

While the jobs are running, you can type squeue -u $USER to check the progress.

After the simulations finish, the final results will be automatically copied from the $SCRATCH/[runDirectory] to $WORK/RESULTS/[runDirectory].