GPU QuickStart Guide - shawfdong/hyades GitHub Wiki

Each of the 8 GPU nodes in Hyades contains an Nvidia Tesla K20 GPU Accelerator. Based on the Kepler architecture, K20 accelerator boasts the following key features:

GPU
- Chip: Kepler GK110^[1]
- Streaming Multiprocessors (SMX): 13
- CUDA cores: 2496
- Double precision units: 832
- Core clock: 706 MHz
Performance
- Peak double precision floating point performance: 1.17 TFLOPS = 0.706 (GHz) x 832 (DP units) x 2 (FMA)
- Peak single precision floating point performance: 3.52 TFLOPS = 0.706 (GHz) x 2496 (CUDA cores) x 2 (FMA)
Memory
- Memory size (GDDR5): 5 GB
- Memory clock: 2.6 GHz
- Memory bandwidth: 208 GB/s = 2.6 (GHz) x 2 (DDR) x 320 / 8
CUDA compute capability: 3.5

In this guide, I demonstrate several different ways to implement a simple AXPY computation on the CUDA platform^[2]. AXPY stands for A·X Plus Y. It is a function in the standard Basic Linear Algebra Subroutines (BLAS) library.

Table of Contents CUDA C OpenCL OpenGL Compute Shader Fortran OpenACC Python NumbaPro PyCUDA PyOpenCL References

CUDA C

CUDA^[3] C is the C interface to the CUDA parallel computing platform^[4]^[5]. It consists of a minimal set of extensions to the C programming language that allow users to program the GPU directly using a high level language. It also consists of a runtime library of C functions that execute on the host to allocate and deallocate device memory, transfer data between host memory and device memory, manage systems with multiple devices, etc. The runtime is built on top of a lower-level C API, the CUDA driver API, which is also accessible by the application.

CUDA C++ only supports a subset of C++ for the device code, as described in the CUDA C programming guide.

Here is a proof-of-concept CUDA C program (daxpy.cu) that computes DAXPY (error handling removed for code clarity):

The code is based on a sample program by TAMU^[6]. The original version, however, has several bugs!

To compile the sample code, first Load the cuda module (for CUDA Toolkit 6.5):

module load cuda

You can append the above line to your ~/.bashrc to make it permanent. Note that although there is no Nvidia GPU in the master node Hyades, you can compile your CUDA codes there.

Compile the code with NVIDIA's CUDA Compiler (NVCC)^[7]:

$ nvcc -Xcompiler "-O3" -gencode arch=compute_35,code=sm_35 -o daxpy.cu.x daxpy.cu

NOTE

By default, nvcc invokes the GPU compiler gcc for host code compilation.
If you prefer Intel C/C++ compiler, use option -ccbin icpc.
Use the option to specify options directly to the compiler/preprocessor.
By default, nvcc compiles codes for Fermi GPUs (the default is -arch=compute_20 -code=sm_20,compute_20).
To compile codes for Nvidia Tesla K20 (a Kepler GPU), use option -gencode arch=compute_35,code=sm_35.
For further details, run nvcc -h, or consult the documentation NVIDIA CUDA Compiler Driver NVCC.

Run the executable on one of the GPU nodes:

$ ssh gpu-1 /pfs/dong/gpu/daxpy.cu.x

For productions runs, please submit your jobs to the PBS queue gpu.

OpenCL

Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, DSPs (digital signal processors), FPGAs (field-programmable gate arrays) and other processors. OpenCL includes a language (based on C99) for programming these devices, and APIs to control the platform and execute programs on the compute devices. OpenCL provides parallel computing using task-based and data-based parallelism.

The latest OpenCL is 2.0^[8]^[9], release on November 18, 2013. However, as of December 2014, the Nvidia driver only supports OpenCL 1.1^[10]^[11].

Here is a proof-of-concept OpenCL program (saxpy.cl.c) that computes SAXPY (error handling removed for code clarity):

Compile the code:

$ gcc -o saxpy.cl.x -Wall -I/pfs/sw/cuda/6.5/include -L/pfs/sw/cuda/6.5/lib64 -lOpenCL saxpy.cl.c

Run the executable on one of the GPU nodes:

$ ssh gpu-1 /pfs/dong/gpu/saxpy.cl.x

NOTE

The NVDIA OpenCL implementation does not offer the optional extension (for double floating-point precision support)^[12].
The development of OpenCL support for the CUDA architecture appears to be stagnant — there is no new release since CUDA Toolkit 4.2^[13]^[14]^[15]^[16].
On Nvidia cards, it is generally better off coding in CUDA C.

OpenGL Compute Shader

See OpenGL on Nvidia K20.

Fortran

In mid 2009, the Portland Group (PGI) and NVIDIA cooperated to develop CUDA Fortran. CUDA Fortran includes a Fortran 2003 compiler and tool chain for programming NVIDIA GPUs using Fortran^[17]. Just as CUDA C is C with extensions, CUDA Fortran is essentially modern Fortran with a few extensions that allow the user to leverage the power of GPUs in their computations^[18].

Here is a proof-of-concept CUDA Fortran program (daxpy.cuf) that computes DAXPY:

The code is a complete rewrite of a sample program by TAMU^[19].

To compile the code, first load the pgi module (for PGI Compilers 14.10):

module load pgi

Compile the code with PGI Fortran compiler pgfortran:

$ pgfortran -Mcuda=cuda6.5,cc35 daxpy.cuf -o daxpy.cuf.x

NOTE

Here I use the .cuf suffix for CUDA Fortran programs. We can use the conventional Fortran 90 suffix .f90 too.
To compile the code for Nvidia Tesla K20 (a Kepler GPU) using CUDA Toolkit 6.5, we use the option -Mcuda=cuda6.5,cc35.
To learn more about PGI Fortran compiler, consult PGI Compiler User's Guide^[20] or the man page (man pgfortran).

Run the executable on one of the GPU nodes:

$ ssh gpu-1 /pfs/dong/gpu/daxpy.cuf.x

For productions runs, please submit your jobs to the PBS queue gpu.

OpenACC

OpenACC is an accelerator programming standard that enables Fortran and C/C++ programmers to easily take advantage of the power of heterogeneous CPU/accelerator systems. OpenACC allows programmers to use simple compiler directives to identify which areas of code to accelerate, without requiring modification to the underlying code itself. By identifying parallel code segments, OpenACC directives allow the compiler to do the detailed work of mapping the computation onto the accelerator^[21]^[22]^[23].

Here is a proof-of-concept C program (daxpy.c) that computes DAXPY, using both OpenACC and OpenMP directives.

And here is the Fortran version (daxpy.f90).

The codes are complete rewrite of sample programs at TAMU^[24]. Given various compilation options, the codes can be built as serial, or OpenMP, or OpenACC, or OpenMP+OpenACC programs.

On Hyades, OpenACC support is provided by the PGI Compilers. Let's first load the pgi module (for PGI Compilers 14.10):

module load pgi

Compile the C code as a serial program (all directives ignored):

$ pgcc -o daxpy.x daxpy.c

Compile the C code as an OpenMP program (OpenACC directives ignored):

$ pgcc -mp -o daxpy.omp.x daxpy.c

Compile the C code as an OpenACC program (OpenMP directives ignored):

$ pgcc -acc -ta=nvidia,kepler -Mcuda=6.5 -o daxpy.acc.x daxpy.c

Compile the C code as an OpenMP+OpenACC program:

$ pgcc -mp -acc -ta=nvidia,kepler -Mcuda=6.5 -o daxpy.omp.acc.x daxpy.c

NOTE

pgcc is the PGI C compiler. Use the PGI Fortran compiler pgfortran to compile Fortran codes.
Use the -mp option to build OpenMP programs with PGI compilers. For GCC, the option is -fopenmp; and for Intel Compilers, option is -openmp.
Use the option -acc -ta=nvidia,kepler -Mcuda=6.5 to build OpenACC executables that target the Nvidia Kepler architecture and use CUDA Toolkit 6.5.

Run the OpenMP+OpenACC executable on one of the GPU nodes:

$ ssh gpu-1 OMP_NUM_THREADS=16 /pfs/dong/gpu/daxpy.omp.acc.x

For productions runs, please submit your jobs to the PBS queue gpu.

NOTE

If the environmental variable OMP_NUM_THREADS is not set, an OpenMP executable built with PGI compilers will spawn only one thread!
In contrast, an OpenMP executable built with either GCC or Intel Compilers will by deafult spawn as many threads as available cores.

Python

NumbaPro

NumbaPro is a Python compiler from Continuum Analytics that can compile Python code for execution on CUDA-capable GPUs or multicore CPUs. With NumbaPro, Python developers can define NumPy ufuncs and generalized ufuncs (gufuncs) in Python, which are compiled to machine code dynamically and loaded on the fly. Additionally, NumbaPro offers developers the ability to target multicore and GPU architectures with Python code for both ufuncs and general-purpose code^[25].

Here is a proof-of-concept NumbaPro program (daxpy_numbapro.py) that computes DAXPY:

NumbaPro is proprietary software. It is part of the Anaconda Accelerate product, which is an add-on to Continuum’s free enterprise Python distribution, Anaconda; and requires a license to run. However, Continuum has kindly given us an Anaconda Academic License free of charge.

The very first time you use NumbaPro, set up the license:

$ mkdir ~/.continuum
$ cp /pfs/sw/python/Anaconda-2.1.0/license_academic_20141216224815.txt ~/.continuum

Run the sample Python code on one of the GPU nodes:

$ ssh gpu-1
[gpu-1]$ module load python/Anaconda-2.1.0
[gpu-1]$ cd /pfs/dong/gpu
[gpu-1]$ python daxpy_numbapro.py

NOTE

To use NumbaPro, you must load the python/Anaconda-2.1.0 module (for Continuum's Anaconda Python distribution).
For productions runs, please submit your jobs to the PBS queue gpu.

PyCUDA

Another option for accelerating Python code on an Nvidia GPU is PyCUDA. PyCUDA provides easy, Pythonic access to CUDA parallel computation API^[26]. Unlike NumbaPro, PyCUDA is MIT-licensed free and open-source software.

Here is a proof-of-concept PyCUDA program (daxpy_pycuda.py) that computes DAXPY:

Run the sample Python code on one of the GPU nodes:

$ ssh gpu-1
[gpu-1]$ module load python
[gpu-1]$ cd /pfs/dong/gpu
[gpu-1]$ python daxpy_pycuda.py

NOTE

To use PyCUDA, you must load the python/2.7.8 module (for the Python 2.7.8 distribution).
For productions runs, please submit your jobs to the PBS queue gpu.

PyOpenCL

PyOpenCL provides easy, Pythonic access to the OpenCL parallel computation API^[27].

Here is a proof-of-concept PyOpenCL program (saxpy_opencl.py) that computes SAXPY:

Run the sample Python code on one of the GPU nodes:

$ ssh gpu-1
[gpu-1]$ module load python
[gpu-1]$ cd /pfs/dong/gpu
[gpu-1]$ python saxpy_opencl.py

NOTE

To use PyOpenCL, you must load the python/2.7.8 module (for the Python 2.7.8 distribution).
On Nvidia cards, it is generally better off coding in PyCUDA than in PyOpenCL.
For productions runs, please submit your jobs to the PBS queue gpu.

GPU QuickStart Guide - shawfdong/hyades GitHub Wiki

Table of Contents

CUDA C

OpenCL

OpenGL Compute Shader

Fortran

OpenACC

Python

NumbaPro

PyCUDA

PyOpenCL

References

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️