GPU QuickStart Guide - shawfdong/hyades GitHub Wiki
Each of the 8 GPU nodes in Hyades contains an Nvidia Tesla K20 GPU Accelerator. Based on the Kepler architecture, K20 accelerator boasts the following key features:
- GPU
- Chip: Kepler GK110[1]
- Streaming Multiprocessors (SMX): 13
- CUDA cores: 2496
- Double precision units: 832
- Core clock: 706 MHz
- Performance
- Peak double precision floating point performance: 1.17 TFLOPS = 0.706 (GHz) x 832 (DP units) x 2 (FMA)
- Peak single precision floating point performance: 3.52 TFLOPS = 0.706 (GHz) x 2496 (CUDA cores) x 2 (FMA)
- Memory
- Memory size (GDDR5): 5 GB
- Memory clock: 2.6 GHz
- Memory bandwidth: 208 GB/s = 2.6 (GHz) x 2 (DDR) x 320 / 8
- CUDA compute capability: 3.5
CUDA[3] C is the C interface to the CUDA parallel computing platform[4][5]. It consists of a minimal set of extensions to the C programming language that allow users to program the GPU directly using a high level language. It also consists of a runtime library of C functions that execute on the host to allocate and deallocate device memory, transfer data between host memory and device memory, manage systems with multiple devices, etc. The runtime is built on top of a lower-level C API, the CUDA driver API, which is also accessible by the application.
CUDA C++ only supports a subset of C++ for the device code, as described in the CUDA C programming guide.
Here is a proof-of-concept CUDA C program (daxpy.cu) that computes DAXPY (error handling removed for code clarity):
The code is based on a sample program by TAMU[6]. The original version, however, has several bugs!
To compile the sample code, first Load the cuda module (for CUDA Toolkit 6.5):
module load cudaYou can append the above line to your ~/.bashrc to make it permanent. Note that although there is no Nvidia GPU in the master node Hyades, you can compile your CUDA codes there.
Compile the code with NVIDIA's CUDA Compiler (NVCC)[7]:
$ nvcc -Xcompiler "-O3" -gencode arch=compute_35,code=sm_35 -o daxpy.cu.x daxpy.cu
NOTE
- By default, nvcc invokes the GPU compiler gcc for host code compilation.
- If you prefer Intel C/C++ compiler, use option -ccbin icpc.
- Use the option to specify options directly to the compiler/preprocessor.
- By default, nvcc compiles codes for Fermi GPUs (the default is -arch=compute_20 -code=sm_20,compute_20).
- To compile codes for Nvidia Tesla K20 (a Kepler GPU), use option -gencode arch=compute_35,code=sm_35.
- For further details, run nvcc -h, or consult the documentation NVIDIA CUDA Compiler Driver NVCC.
$ ssh gpu-1 /pfs/dong/gpu/daxpy.cu.xFor productions runs, please submit your jobs to the PBS queue gpu.
Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, DSPs (digital signal processors), FPGAs (field-programmable gate arrays) and other processors. OpenCL includes a language (based on C99) for programming these devices, and APIs to control the platform and execute programs on the compute devices. OpenCL provides parallel computing using task-based and data-based parallelism.
The latest OpenCL is 2.0[8][9], release on November 18, 2013. However, as of December 2014, the Nvidia driver only supports OpenCL 1.1[10][11].
Here is a proof-of-concept OpenCL program (saxpy.cl.c) that computes SAXPY (error handling removed for code clarity):
- Compile the code:
$ gcc -o saxpy.cl.x -Wall -I/pfs/sw/cuda/6.5/include -L/pfs/sw/cuda/6.5/lib64 -lOpenCL saxpy.cl.c
- Run the executable on one of the GPU nodes:
$ ssh gpu-1 /pfs/dong/gpu/saxpy.cl.x
NOTE
- The NVDIA OpenCL implementation does not offer the optional extension (for double floating-point precision support)[12].
- The development of OpenCL support for the CUDA architecture appears to be stagnant — there is no new release since CUDA Toolkit 4.2[13][14][15][16].
- On Nvidia cards, it is generally better off coding in CUDA C.
See OpenGL on Nvidia K20.
In mid 2009, the Portland Group (PGI) and NVIDIA cooperated to develop CUDA Fortran. CUDA Fortran includes a Fortran 2003 compiler and tool chain for programming NVIDIA GPUs using Fortran[17]. Just as CUDA C is C with extensions, CUDA Fortran is essentially modern Fortran with a few extensions that allow the user to leverage the power of GPUs in their computations[18].
Here is a proof-of-concept CUDA Fortran program (daxpy.cuf) that computes DAXPY:
The code is a complete rewrite of a sample program by TAMU[19].
To compile the code, first load the pgi module (for PGI Compilers 14.10):
module load pgi
Compile the code with PGI Fortran compiler pgfortran:
$ pgfortran -Mcuda=cuda6.5,cc35 daxpy.cuf -o daxpy.cuf.x
NOTE
- Here I use the .cuf suffix for CUDA Fortran programs. We can use the conventional Fortran 90 suffix .f90 too.
- To compile the code for Nvidia Tesla K20 (a Kepler GPU) using CUDA Toolkit 6.5, we use the option -Mcuda=cuda6.5,cc35.
- To learn more about PGI Fortran compiler, consult PGI Compiler User's Guide[20] or the man page (man pgfortran).
$ ssh gpu-1 /pfs/dong/gpu/daxpy.cuf.xFor productions runs, please submit your jobs to the PBS queue gpu.
OpenACC is an accelerator programming standard that enables Fortran and C/C++ programmers to easily take advantage of the power of heterogeneous CPU/accelerator systems. OpenACC allows programmers to use simple compiler directives to identify which areas of code to accelerate, without requiring modification to the underlying code itself. By identifying parallel code segments, OpenACC directives allow the compiler to do the detailed work of mapping the computation onto the accelerator[21][22][23].
Here is a proof-of-concept C program (daxpy.c) that computes DAXPY, using both OpenACC and OpenMP directives.
And here is the Fortran version (daxpy.f90).
The codes are complete rewrite of sample programs at TAMU[24]. Given various compilation options, the codes can be built as serial, or OpenMP, or OpenACC, or OpenMP+OpenACC programs.
- On Hyades, OpenACC support is provided by the PGI Compilers. Let's first load the pgi module (for PGI Compilers 14.10):
module load pgi
- Compile the C code as a serial program (all directives ignored):
$ pgcc -o daxpy.x daxpy.c
- Compile the C code as an OpenMP program (OpenACC directives ignored):
$ pgcc -mp -o daxpy.omp.x daxpy.c
- Compile the C code as an OpenACC program (OpenMP directives ignored):
$ pgcc -acc -ta=nvidia,kepler -Mcuda=6.5 -o daxpy.acc.x daxpy.c
- Compile the C code as an OpenMP+OpenACC program:
$ pgcc -mp -acc -ta=nvidia,kepler -Mcuda=6.5 -o daxpy.omp.acc.x daxpy.c
NOTE
- pgcc is the PGI C compiler. Use the PGI Fortran compiler pgfortran to compile Fortran codes.
- Use the -mp option to build OpenMP programs with PGI compilers. For GCC, the option is -fopenmp; and for Intel Compilers, option is -openmp.
- Use the option -acc -ta=nvidia,kepler -Mcuda=6.5 to build OpenACC executables that target the Nvidia Kepler architecture and use CUDA Toolkit 6.5.
$ ssh gpu-1 OMP_NUM_THREADS=16 /pfs/dong/gpu/daxpy.omp.acc.xFor productions runs, please submit your jobs to the PBS queue gpu.
NOTE
- If the environmental variable OMP_NUM_THREADS is not set, an OpenMP executable built with PGI compilers will spawn only one thread!
- In contrast, an OpenMP executable built with either GCC or Intel Compilers will by deafult spawn as many threads as available cores.
NumbaPro is a Python compiler from Continuum Analytics that can compile Python code for execution on CUDA-capable GPUs or multicore CPUs. With NumbaPro, Python developers can define NumPy ufuncs and generalized ufuncs (gufuncs) in Python, which are compiled to machine code dynamically and loaded on the fly. Additionally, NumbaPro offers developers the ability to target multicore and GPU architectures with Python code for both ufuncs and general-purpose code[25].
Here is a proof-of-concept NumbaPro program (daxpy_numbapro.py) that computes DAXPY:
NumbaPro is proprietary software. It is part of the Anaconda Accelerate product, which is an add-on to Continuum’s free enterprise Python distribution, Anaconda; and requires a license to run. However, Continuum has kindly given us an Anaconda Academic License free of charge.
- The very first time you use NumbaPro, set up the license:
$ mkdir ~/.continuum $ cp /pfs/sw/python/Anaconda-2.1.0/license_academic_20141216224815.txt ~/.continuum
- Run the sample Python code on one of the GPU nodes:
$ ssh gpu-1 [gpu-1]$ module load python/Anaconda-2.1.0 [gpu-1]$ cd /pfs/dong/gpu [gpu-1]$ python daxpy_numbapro.py
NOTE
- To use NumbaPro, you must load the python/Anaconda-2.1.0 module (for Continuum's Anaconda Python distribution).
- For productions runs, please submit your jobs to the PBS queue gpu.
Another option for accelerating Python code on an Nvidia GPU is PyCUDA. PyCUDA provides easy, Pythonic access to CUDA parallel computation API[26]. Unlike NumbaPro, PyCUDA is MIT-licensed free and open-source software.
Here is a proof-of-concept PyCUDA program (daxpy_pycuda.py) that computes DAXPY:
Run the sample Python code on one of the GPU nodes:
$ ssh gpu-1 [gpu-1]$ module load python [gpu-1]$ cd /pfs/dong/gpu [gpu-1]$ python daxpy_pycuda.py
NOTE
- To use PyCUDA, you must load the python/2.7.8 module (for the Python 2.7.8 distribution).
- For productions runs, please submit your jobs to the PBS queue gpu.
PyOpenCL provides easy, Pythonic access to the OpenCL parallel computation API[27].
Here is a proof-of-concept PyOpenCL program (saxpy_opencl.py) that computes SAXPY:
Run the sample Python code on one of the GPU nodes:
$ ssh gpu-1 [gpu-1]$ module load python [gpu-1]$ cd /pfs/dong/gpu [gpu-1]$ python saxpy_opencl.py
NOTE
- To use PyOpenCL, you must load the python/2.7.8 module (for the Python 2.7.8 distribution).
- On Nvidia cards, it is generally better off coding in PyCUDA than in PyOpenCL.
- For productions runs, please submit your jobs to the PBS queue gpu.
- ^ NVIDIA Kepler GK110 Architecture Whitepaper
- ^ Six Ways to SAXPY
- ^ CUDA Toolkit Documentation - v6.5
- ^ CUDA C Programming Guide
- ^ CUDA C Best Practices Guide
- ^ Compiling and Running CUDA Programs
- ^ NVIDIA CUDA Compiler Driver NVCC
- ^ OpenCL 2.0 Specification
- ^ OpenCL 2.0 Reference Card
- ^ OpenCL 1.1 Specification
- ^ OpenCL API 1.1 Quick Reference Card
- ^ cl_khr_fp64 - support for double floating-point precision
- ^ OpenCL Programming Guide for the CUDA Architecture
- ^ OpenCL Programming Overview for the CUDA Architecture
- ^ OpenCL Best Practices Guide
- ^ OpenCL Jumpstart Guide
- ^ PGI CUDA Fortran Compiler
- ^ CUDA Fortran Programming Guide and Reference (Version 2014)
- ^ Compiling CUDA Fortran with PGI Compilers
- ^ PGI Compiler User's Guide (Version 2014)
- ^ PGI Accelerator Compilers With OpenACC Directives
- ^ OpenACC 2.0a Specification
- ^ OpenACC 2.0 Quick Reference Guide
- ^ Compiling OpenACC with PGI Compilers
- ^ NumbaPro
- ^ PyCUDA documentation
- ^ PyOpenCL documentation