ColonialOne - mlbendall/telescope_tutorial GitHub Wiki

1-iii. ColonialOne

ColonialOne Wiki

ColonialOne Hardware Configuration

The Colonial One High-Performance Cluster currently consists of 213 compute nodes accessible through four login nodes. Login nodes provide remote access through SSH, and file transfer services through SCP/SFTP and Globus. The cluster uses Dell C8000 chassis, with both C8220 (CPU nodes) and C8220x (GPU nodes) models.

The login nodes, which should be used only for submitting jobs, file-transfers, software compilation, and simulation preparation are accessible as login.colonialone.gwu.edu. The two current login nodes can also be directly accessed as login3.colonialone.gwu.edu or login4.colonialone.gwu.edu. You can use SSH, or SCP/SFTP to connect.

Available filesystems

There are two main filesystems on Colonial One. The first is connected over NFS and used to store /home and /groups, and has 250TB of usable space. The second is a high-speed Lustre scratch filesystem, accessed as /lustre. There is another filesystem, /import, which can be purchased for archival storage. By default, you have access to three locations (with a fourth archival option available for purchase):

/home/$netid/ - your home directory. Default quota of 25GB. NOTE no jobs should be run against the /home partition, it is not designed for performance. Use /lustre instead.

/groups/$group/ - shared group space, accessible by anyone in your group. (If you are unsure of what group you are in, the groups command will tell you.) Default quota of 250GB. NOTE no jobs should be run against the /groups partition, it is not designed for performance. Use /lustre instead.

/lustre/groups/$group/ - shared group scratch space

/import/$group/ - archival storage that can be purchased. NOTE no jobs should be run against the /import partition, it is for archival purposes only and is not designed for performance. Use /lustre instead.

Submitting jobs on the cluster

The Slurm workload scheduler is used to manage the compute nodes on the cluster. Jobs must be submitted through the scheduler to have access to compute resources on the system. There are several partitions (aka "queues") currently configured on the system:

  • short - has access to 128GB Ivy Bridge nodes with a shorter (currently 2-day) timelimit, designed for quicker turnaround of shorter running jobs. Some nodes here overlap with defq.
  • defq - default compute nodes, CPU only, and either 64, 128, or 256GB of memory.
  • debug - See DebugPartition for details.
  • 128gb, 256gb - explicitly request compute nodes with 128GB, or 256GB of memory respectively for larger memory jobs.
  • 2tb - a special-purpose machine with 2TB of RAM and 48 3GHz CPU cores. Access is restricted to this partition, please email [email protected] if you have applications appropriate for this unique system.
  • gpu - has access to the GPU nodes, each has two NVIDIA K20 GPUs
  • gpu-noecc - has access to the same GPU nodes, but disables error-correction on the GPU memory before the job runs
  • ivygpu-noecc - has the same NVIDIA K20 GPUs, but with newer Ivy Bridge Xeon processors.

Note that you must set a timelimit (with the -t flag, for example -t 1:00:00 would set a limit of one hour) for your jobs when submitting, otherwise they will be immediately rejected. This allows the Slurm scheduler to keep the system busy by backfilling when trying to allocate resources for larger jobs. Currently there is no maximum timelimit, but you are encouraged to keep jobs limited to a day - longer running processes should checkpoint and restart to avoid loosing significant amounts of outstanding work if there is a problem with the hardware or cluster configuration. The maximum timelimit for any job is 14 days.

Previous Section This Section Next Section
Unix Bootcamp ColonialOne Tutorial Setup