#GridFTP - shawfdong/hyades GitHub Wiki
Welcome to the wiki for Hyades!
Hyades is a Supercomputer dedicated to Computational Astrophysics research at University of California, Santa Cruz (UCSC). It is supported by a million-dollar grant from National Science Foundation (award number AST-1229745) and additional matching funds from UCSC.
Architecturally, Hyades is a cluster comprised of the following components:
Component | QTY | Description |
---|---|---|
Master Node | 1 | Dell PowerEdge R820, 4x 8-core Intel Xeon E5-4620 (2.2 GHz), 128GB memory, 8x 1TB HDDs |
Analysis Node | 1 | Dell PowerEdge R820, 4x 8-core Intel Xeon E5-4640 (2.4 GHz), 512GB memory, 2x 600GB SSDs |
Type I Compute Nodes | 180 | Dell PowerEdge R620, 2x 8-core Intel Xeon E5-2650 (2.0 GHz), 64GB memory, 1TB HDD |
Type IIa Compute Nodes | 8 | Dell PowerEdge C8220x, 2x 8-core Intel Xeon E5-2650 (2.0 GHz), 64GB memory, 2x 500GB HDDs, 1x Nvidia K20 |
Type IIb Compute Nodes | 1 | Dell PowerEdge R720, 2x 6-core Intel Xeon E5-2630L (2.0 GHz), 64GB memory, 500GB HDD, 2x Xeon Phi 5110P |
Lustre Storage | 1 | 146TB of usable storage served from a Terascala/Dell storage cluster |
ZFS Server | 1 | SuperMicro Server, 2x 4-core Intel Xeon E5-2609V2 (2.5 GHz), 64GB memory, 2x 120GB SSDs, 36x 4TB HDDs |
Cloud Storage | 1 | 1PB of raw storage served from a Huawei UDS system |
InfiniBand | 17 | 17x Mellanox IS5024 QDR (40Gb/s) InfiniBand switches, configured in a 1:1 non-blocking Fat Tree topology |
Gigabit Ethernet | 7 | 7x Dell 6248 GbE switches, stacked in a Ring topology |
10-gigabit Ethernet | 1 | 1x Dell 8132F 10GbE switch |
The Master/Login Node is the entry point to the Hyades cluster. It is a Dell PowerEdge R820 server that contains four (4x) 8-core Intel Sandy Bridge Xeon E5-4620 processors at 2.2 GHz, 128 GB memory and eight (8x) 1TB hard drives in a RAID-6 array. Primary tasks to be performed on the Master Node are:
- Editing codes and scripts
- Compiling codes
- Short test runs and debugging runs
- Submitting and monitoring jobs
The hostname of the Master Node is hyades.ucsc.edu (IP: 128.114.126.225). To access the Master Node, use an SSH client that supports the SSH-2 protocol. Then execute the following command (replace username with your own real username):
ssh -l username hyades.ucsc.edu
or:
ssh [email protected]
The Visualization & Analysis Node is Eudora (hostname: eudora.ucsc.edu; IP: 128.114.126.226). Eudora is another public host of the Hyades cluster. It is a Dell PowerEdge R820 server that contains four (4x) 8-core Intel Sandy Bridge Xeon E5-4640 processors at 2.4 GHz, half a TB memory and two (2x) 600GB SSDs in a RAID-0 array. Eudora is designed to run jobs that require a lot of memory and/or fast IO speed. It is ideal for Visualization & Data Analysis tasks.
To access Eudora, use an SSH client that supports the SSH-2 protocol. Then execute the following command:
ssh -l username eudora.ucsc.edu
or:
ssh [email protected]
There are 3 types of Compute Nodes in the Hyades cluster:
- Type I Compute Nodes (CNs I) are conventional compute nodes (180 in total). Each CN I is a Dell PowerEdge R620 server containing two (2x) 8-core Intel Sandy Bridge Xeon E5-2650 processors at 2.0 GHz, 64 GB memory and one 1TB hard drive. Among the 180 CNs I, 2/3 (120 nodes) have Hyper-Threading turned off (thus the operating system addresses 16 cores in each node); while the remaining 1/3 (60 nodes) have Hyper-threading turned on (thus the operating system addresses 32 virtual or logical cores in each node). The former belong to the normal queue; and the latter to the hyper queue of the Torque batch system.
- Type IIa Compute Nodes (CNs IIa) are GPU nodes (8 in total). Each CN IIa is a Dell C8220x server containing two (2x) 8-core Intel Sandy Bridge Xeon E5-2650 processors at 2.0 GHz, 64 GB memory, two (2x) 500GB hard drives, and one Nvidia K20 GPU. All GPU nodes have Hyper-threading turned off (thus the operating system addresses 16 cores in each node); and they belong to the gpu queue of the Torque batch system.
- Many Integrated Core (MIC) Architecture is Intel's response to GPU or Accelerated computing. We were very fortunate that we received a donation of two (2x) Xeon Phi 5110P processors from Intel in 2013. We've since integrated those 2 Xeon Phi processors into a Dell PowerEdge R720 server, which contains two (2x) 6-core Intel Sandy Bridge Xeon E5-2630L processors at 2.0 GHz, 64 GB memory and one 500GB hard drive. That machine is our one and only Type IIb Compute Node (CN IIb), and is christened as Aesyle. To experiment with MIC computing, please consult our MIC QuickStart Guide.
The Storage subsystem of the Hyades cluster is a rich medley of many interesting technologies:
- The /home partition is served from a ZFS pool on a FreeBSD server. The server is a SuperMicro box containing 2x 4-core Intel Ivy Bridge Xeon E5-2609V2 processors at 2.5 GHz, 64GB memory, 2x 120GB SSDs, 36x 4TB HDDs. Among those 36x HDDs, 12 are in a RAIDZ2 ZFS volume which is NFS-mounted at /home on each node; the remaining 24 are in another RAIDZ2 ZFS volume which is NFS-mounted at /trove on the Master Node & Eudora.
- On Hyades, the workhorse file system is Lustre, which is a high performance parallel distributed file system, and is widely used in top supercomputers in the world. The Lustre storage of Hyades is served from a Terascala/Dell storage cluster. It provides 146TB of usable capacity, and is mounted at /pfs on each node.
- The last piece of our storage jigsaw is a Huawei Cloud Storage system. In 2013, We were very privileged to collaborate with Huawei on deploying a UDS cloud storage system at UCSC. The Huawei Cloud Storage system provides a petabyte-level data storage, archiving, and sharing platform for Hyades. It is an Object Storage System, utilizing the Amazon S3 protocol. For further details, please refer to the main article Huawei Cloud Storage.
Here is a bird's eye view of the internetworking of the Hyades cluster:
- InfiniBand fabric is the expressway of Hyades. The backbone of the InfiniBand fabric is made up of 17 Mellanox IS5024 QDR InfiniBand switches, which are interconnected to form a 1:1 non-blocking Fat Tree topology. The InfiniBand fabric delivers high bandwidth (40 Gb/s) as well as low latency (~ 1 microsecond). Every Compute Node, the Master Node, Eudora, as well as the Lustre storage cluster are all plugged into InfiniBand fabric. By default, the Message Passing of your MPI programs is conducted through InfiniBand; so is the Lustre file system served to all nodes in the cluster.
- Every Compute Node, the Master Node, Eudora, the Lustre storage cluster, and the ZFS file servers are all interconnected through a Gigabit Ethernet (GbE) fabric too. The backbone of the fabric is made up of 7 Dell 6248 GbE switches, stacked in a Ring topology. The GbE is mostly used for management and Network File System (NFS) traffics. Although it is possible to run MPI programs through Gigabit Ethernet, it is not wise to do so; as the bandwidth is too low (1 Gb/s, of course) and the latency too high (~ a few milliseconds).
- All the public hosts are also connected to a Dell 8132F 10-gigabit Ethernet (10GbE) switch, which, via UCSC's 10G routers, exposes Hyades to the chaotic and wild Internet. Moreover, it is worth noting that the Network File Systems (/home & /trove) are served to the Master Node and Eudora via 10GbE.
Each user has a home directory at /home/$USER, where $USER is the username. The home directory is NFS-mounted on all the nodes in the Hyades cluster. It has a usable capacity of 36TB and is intended for storing your source codes and configuration files, and some reasonable amount of data as well. But because its I/O performance is relatively sluggish, do not run your jobs from your home directory!
Instead you should run jobs from the Lustre scratch storage, which is mounted at /pfs on all the nodes. For your convenience, a symbolic link pfs (pointing to /pfs/$USER) is also created in your home directory.
For more details on storage, see the subsection Storage.
We use the Environment Modules tool to manage users' software environment, via modulefiles. The Intel Compilers module and the Intel MPI module are loaded by default.
To see what modules are currently loaded, run:
module list
These modules are loaded by default:
- intel_mpi/4.1.3
- intel_compilers/14.0.1
module avail
To learn the usage of the module tool, run:
module --help
Intel Compilers are the default and recommended compilers on Hyades; PGI Compilers and GNU Compiler Collection (GCC) are available as alternatives. The following table summarizes how to compile C/C++ and Fortran 77/90 serial programs using the Intel Compilers.
Compiler | Program | TypeSuffix | Example |
---|---|---|---|
icc | C | .c | icc [compiler_options] prog.c |
icpc | C++ | .C, .cc, .cpp, .cxx | icpc [compiler_options] prog.cpp |
ifort | Fortran 77 | .f, .for, .ftn | ifort [compiler_options] prog.f |
ifort | Fortran 90 | .f90, .fpp | ifort [compiler_options] prog.f90 |
Here are a few examples:
To compile hello.c, a serial "Hello world" program written in C, run
icc -o hello.x hello.c
To compile hello.cpp, a serial "Hello world" program written in C++, run
icpc -o hello.x hello.cpp
To compile hello.f, a serial "Hello world" program written in Fortran 77, run
ifort -o hello.x hello.f
To compile hello.f90, a serial "Hello world" program written in Fortran 90, run
ifort -o hello.x hello.f90
Intel MPI is the default MPI implementation on Hyades. The Intel MPI Library is a multi-fabric message passing library that implements the Message Passing Interface v2.2 (MPI-2.2) specification. The following table summarizes how to compile MPI programs in C/C++ and Fortran 77/90, using Intel MPI.
MPI Compiler Command | Default Compiler | Supported Language(s) |
---|---|---|
mpicc | gcc | C |
mpicxx | g++ | C/C++ |
mpifc | gfortran | Fortran 77/90 |
mpigcc | gcc | C |
mpigxx | g++ | C/C++ |
mpif77 | g77 | Fortran 77 |
mpif90 | gfortran | Fortran 90 |
mpiicc | icc | C |
mpiicpc | icpc | C++ |
mpiifort | ifort | Fortran 77/90 |
The mpicmds in the table above are just wrappers of the GNU and Intel compilers. They automatically link startup and message passing libraries for Intel MPI into the executables. Here are a few examples:
To compile mpi_hello.c, an MPI "Hello world" program written in C, run
mpiicc -o mpi_hello.x mpi_hello.c
To compile mpi_hello.cpp, an MPI "Hello world" program written in C++, run
mpiicpc -o mpi_hello.x mpi_hello.cpp
To compile mpi_hello.f, an MPI "Hello world" program written in Fortran 77, run
mpiifort -o mpi_hello.x mpi_hello.f
To compile mpi_hello.f90, an MPI "Hello world" program written in Fortran 90, run
mpiifort -o mpi_hello.x mpi_hello.f90
To compile OpenMP programs using Intel compilers, option -openmp must be set. Here are a few examples:
To compile omp_hello.c, an OpenMP "Hello world" program written in C, run
icc -openmp -o omp_hello.x omp_hello.c
To compile omp_hello.cpp, an OpenMP "Hello world" program written in C++, run
icpc -openmp -o omp_hello.x omp_hello.cpp
To compile omp_hello.f, an OpenMP "Hello world" program written in Fortran 77, run
ifort -openmp -o omp_hello.x omp_hello.f
To compile omp_hello.f90, an OpenMP "Hello world" program written in Fortran 90, run
ifort -openmp -o omp_hello.x omp_hello.f90
Note: for OpenMP programming, it is very important to know the cache line size of the CPU. Here is a tip on how to get the value on a Linux machine:
$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size 64 (bytes)
All the nodes in Hyades are Non-Uniform Memory Access (NUMA) systems. In each of the compute nodes, each of the two (2x) Intel Sandy Bridge Xeon processor has its own integrated memory controller and PCI express controller; and the 2 processors (16 cores in total) share a single QDR InfiniBand link. There are 2 NUMA nodes per compute node, each processor belonging to one. To extract the maximal performance out of such an architecture, it is often profitable to employ a hybrid programming model, in which we launch only one MPI process on each processor (NUMA node) and then start one thread on each core of the processor. This model often compares favorably with the pure MPI model, in which we launch one MPI process on each processor core.
Note: two options, -mt_mpi & -openmp, must be set when compiling MPI/OpenMP hybrid programs:
- -mt_mpi: linking the thread safe version of the Intel MPI Library
- -openmp: enabling the parallelizer to generate multi-threaded code based on OpenMP directives
To compile hybrid_hello.c, a hybrid "Hello world" program written in C, run
mpiicc -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.c
To compile hybrid_hello.cpp, a hybrid "Hello world" program written in C++, run
mpiicpc -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.cpp
To compile hybrid_hello.f, a hybrid "Hello world" program written in Fortran 77, run
mpiifort -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.f
To compile hybrid_hello.f90, a hybrid "Hello world" program written in Fortran 90, run
mpiifort -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.f90
Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for interprocedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.
At the most basic level of optimization that the compiler can perform is -On options, explained below.
Level | Description |
---|---|
n = 0 | Fast compilation, full debugging support; equivalent to -g |
n = 1,2 |
Low to moderate optimization, partial debugging support:
|
n = 3+ |
Aggressive optimization - compile time/space intensive and/or marginal effectiveness; may change code semantics and results (sometimes even breaks code!):
|
The following table lists some of the more important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.
Option | Description |
---|---|
-c | For compilation of source file only. |
-O3 | Aggressive optimization (-O2 is default). |
-xAVX | Optimizes for Intel processors that support AVX (Advanced Vector Extensions) instructions. |
-g | Debugging information, generates symbol table. |
-mp | Maintain floating point precision (disables some optimizations). |
-mp1 | Improve floating-point precision (speed impact is less than -mp). |
-ip | Enable single-file interprocedural (IP) optimizations (within files). |
-ip0 | Enable multi-file IP optimizations (between files). |
-prefetch | Enables data prefetching (requires –O3). |
-openmp | Enable the parallelizer to generate multi-threaded code based on the OpenMP directives. |
For more compiler/linker options, check the ifort and icc man pages, or consult the following online documentations:
- Intel C++ Compiler XE 14.0 User and Reference Guides
- Intel Fortran Compiler XE 14.0 User and Reference Guides
Do not run your codes from your home directory, which is slow in speed and limited in capacity.
On Hyades we use Torque as the resource manager and Maui as the job scheduler. Torque is an open-source derivative of Portable Batch System (PBS). Commonly used Torque tools include:
- qsub, for submitting PBS job
- qstat, for monitoring the status of jobs
- qdel, for terminating jobs prior to completion
Users submit jobs to a queue and wait in line until nodes become available to run the job. There are 3 queues: normal, hyper, and gpu. The default queue is normal, your job will be submitted to the normal queue if no queue name is specified. The following table summarizes the queue characteristics (n below is the number of nodes requested for the job):
Queue | Total # of nodes | resource per node | Max Walltime | qsub options |
---|---|---|---|---|
normal | 120 | 16 cores | 2 days | -l nodes=n:ppn=16 -q normal |
hyper | 60 | 32 cores | 4 days | -l nodes=n:ppn=32 -q hyper |
gpu | 8 | 16 cores and 1 GPU | 10 days | -l nodes=n:ppn=16 -q gpu |
To run your code on Hyades, usually you create a PBS job script, and then use the qsub command to submit the job to a queue. A PBS script is a shell script that contains a few extra comments at the beginning specifying directives to Torque/PBS. You are free to use your favorite shell; we use Bash in the following examples.
To run the serial executable hello.x compiled in subsection Compiling Serial Programs, first make sure the executable resides in the Lustre scratch storage (in /pfs/$USER or one of its subdirectories); then create a PBS script named serial.pbs in the same directory, with the following content:
#!/bin/bash #PBS -N serial #PBS -l ncpus=1 #PBS -l walltime=0:10:00 cd $PBS_O_WORKDIR ./hello.x
Annotations of serial.pbs:
- #PBS -N serial: the job name is serial
- #PBS -l ncpus=1: we request 1 core for the job; alternatively, we can use #PBS -l nodes=1:ppn=1
- #PBS -l walltime=0:10:00: we request 10 minutes of run time
- if we want to submit the job to a queue other than the default normal, add a line like #PBS -q hyper
- cd $PBS_O_WORKDIR: required; PBS starts the scripts from the home directory on the executing computing node
qsub serial.pbs
Torque/PBS will print out the job ID, e.g.:
12345.hyades.ucsc.edu
The standard output of the executable will be saved in a file who name has the following form: job_name.ojob_ID. When our job is completed, for example, we'll get serial.o12345:
$ cat serial.o12345 Hello, world!
Oftentimes we need to run a lot of instances of the same executable simultaneously, but with different parameters. For example, here is a sample serial program (jobarray_hello.x) that takes an integer argument:
./jobarray_hello.x 23 Hello master, I am slave no. 23 running on hyades.ucsc.edu!
For educational purpose, let's assume that we need to run the following instances:
./jobarray_hello.x 101 ./jobarray_hello.x 102 ... ./jobarray_hello.x 164
Instead of submitting 64 serial jobs, we can submit one job array. Create a PBS script named jobarray.pbs, with the following content:
#!/bin/bash #PBS -N jobarray #PBS -l ncpus=1 #PBS -t 101-164 #PBS -l walltime=0:10:00 cd $PBS_O_WORKDIR ./jobarray_hello.x $PBS_ARRAYID
Annotations of jobarray.pbs:
- #PBS -N jobarray: the job name is jobarray
- #PBS -l ncpus=1: although the job array will use 64 cores in total, each member of the job array will use only 1 core
- #PBS -t 101-164: task ids of the job array
- if we want to submit the job to a queue other than the default normal, add a line like #PBS -q hyper
- $PBS_ARRAYID: each member of the job array is assigned a unique identifier with the option -t above
qsub jobarray.pbs
Each member's standard output will be saved in a file whose name has the following form: job_name.ojob_ID-task_id. When our job array is completed, for example, we'll get jobarray.o12345-101, ..., jobarray.o12345-164.
$ cat jobarray.o12345-103 Hello master, I am slave no. 103 running on astro-3-5.local!
Assume that we've successfully compiled the sample MPI program mpi_hello.c in subsection Compiling MPI Programs, and we want to run the executable mpi_hello.x on 64 cores. First make sure the executable resides in the Lustre scratch storage (in /pfs/$USER or one of its subdirectories); then create a PBS script named impi.pbs in the same directory, with the following content:
#!/bin/bash #PBS -N impi #PBS -l nodes=4:ppn=16 #PBS -l walltime=0:10:00 cd $PBS_O_WORKDIR mpirun -genv I_MPI_FABRICS shm:ofa -n 64 ./mpi_hello.x
Annotations of impi.pbs:
- #PBS -N impi: the job name is impi
- #PBS -l nodes=4:ppn=16: the job will run on 4 nodes (64 cores) in the default normal queue
- if we want to submit the job to the hyper queue instead, replace #PBS -l nodes=4:ppn=16 with the following 2 lines:
- #PBS -q hyper
- #PBS -l nodes=2:ppn=32
- cd $PBS_O_WORKDIR: required; PBS starts the scripts from the home directory on the executing computing node
- -env I_MPI_FABRICS shm:ofa: we use shared memory for intra-node communication and OFED verbs for inter-node communication
qsub impi.pbs
To run the OpenMP executable omp_hello.x (compiled in subsection Compiling OpenMP Programs) on 16 cores of a compute node, create a PBS script named omp.pbs, with the following content:
#!/bin/bash #PBS -N omp #PBS -l nodes=1:ppn=16 #PBS -l walltime=0:10:00 export OMP_NUM_THREADS=16 cd $PBS_O_WORKDIR ./omp_hello.x
Annotations of omp.pbs:
- #PBS -N omp: the job name is omp
- #PBS -l nodes=1:ppn=16: we request 16 cores on a compute node
- #PBS -l walltime=0:10:00: we request 10 minutes of run time
- if we want to submit the job to a queue other than the default normal, add a line like #PBS -q hyper
- export OMP_NUM_THREADS=16: set the maximum number of OpenMP threads to 16
- cd $PBS_O_WORKDIR: required; PBS starts the scripts from the home directory on the executing computing node
qsub omp.pbs
To run the MPI/OpenMP hybrid hybrid_hello.x (compiled in subsection Compiling Hybrid Programs) on 64 cores (8 MPI processes and 8 OpenMP threads per MPI process), create a PBS script named hybrid.pbs, with the following content:
#!/bin/bash #PBS -N hybrid #PBS -l nodes=4:ppn=16 #PBS -l walltime=0:10:00 cd $PBS_O_WORKDIR cat $PBS_NODEFILE | sort | uniq > hosts.$PBS_JOBID export OMP_NUM_THREADS=8 export I_MPI_PIN_DOMAIN=omp export KMP_AFFINITY=compact mpirun -machine hosts.$PBS_JOBID -genv I_MPI_FABRICS shm:ofa -n 8 -ppn 2 ./hybrid_hello.x
Annotations of hybrid.pbs:
- #PBS -N hybrid: the job name is hybrid
- #PBS -l nodes=4:ppn=16: the job will run on 4 nodes (64 cores) in the default normal queue
- cd $PBS_O_WORKDIR: required; PBS starts the scripts from the home directory on the executing computing node
- export OMP_NUM_THREADS=8: set the maximum number of OpenMP threads to 8
- export I_MPI_PIN_DOMAIN=omp: control process pinning
- export KMP_AFFINITY=compact: bind OpenMP threads to physical processing units
- -env I_MPI_FABRICS shm:ofa: we use shared memory for intra-node communication and OFED verbs for inter-node communication
- -ppn 2: 2 MPI processes per compute node (one on each processor)
- -n 8: 8 MPI processes in total
- if we want to submit the job to the hyper queue instead, use
- #PBS -q hyper
- #PBS -l nodes=2:ppn=32
- mpirun -genv I_MPI_FABRICS shm:ofa -n 8 -ppn 4 ./hybrid_hello.x
qsub hybrid.pbs