BLAS - shawfdong/hyades GitHub Wiki

BLAS (Basic Linear Algebra Subprograms) are a set of low-level subroutines that perform common linear algebra operations such as copying, vector scaling, vector dot products, linear combinations, and matrix multiplication. BLAS are used as a building block in higher-level math programming languages and libraries, including LAPACK, NumPy and R.

BLAS functionality is divided into three levels: 1, 2 and 3^[1]^[2].

Level 1

vector-vector operations that are linear (O(n)) in data and linear (O(n)) in work.

Level 2

matrix-vector operations that are quadratic (O(n²)) in data and quadratic (O(n²)) in work.

Level 3

operations that are quadratic (O(n²)) in data and cubic (O(n³)) in work.

Each BLAS and LAPACK routine comes in several versions, one for each precision (data type). The first letter of the subprogram name indicates the precision used:

S	Real single precision
D	Real double precision
C	Complex single precision
Z	Complex double precision

There are many implementations of BLAS available. We've installed a few on Hyades.

Table of Contents Netlib BLAS Calling Fortran BLAS from C Calling Fortran BLAS from C++ Netlib CBLAS ATLAS OpenBLAS Intel MKL GSL uBLAS GPU accelerated BLAS cuBLAS NVBLAS References

Netlib BLAS

The official reference implementation on Netlib provides a platform independent implementation of BLAS but without any attempt at optimizing performance. It is written in Fortran 77.

Netlib BLAS can be downloaded separately, or as part of LAPACK.

Download LAPACK 3.5.0:

$ cd /scratch
$ wget http://www.netlib.org/lapack/lapack-3.5.0.tgz
$ tar xvfz lapack-3.5.0.tgz
$ cd lapack-3.5.0

Create the file make.in (based on the provided make.in.example):

SHELL = /bin/sh
FORTRAN  = gfortran 
OPTS     = -O3 -frecursive -march=native -fPIC
DRVOPTS  = $(OPTS)
NOOPT    = -O0 -frecursive -fPIC
LOADER   = gfortran
LOADOPTS =
TIMER    = INT_ETIME
CC = gcc
CFLAGS = -O3 -march=native -fPIC
ARCH     = ar
ARCHFLAGS= cr
RANLIB   = ranlib
XBLASLIB     =
BLASLIB      = ../../libblas.a
LAPACKLIB    = liblapack.a
TMGLIB       = libtmg.a
LAPACKELIB   = liblapacke.a

Compile BLAS:

$ make blaslib

Compile LAPACK:

$ make

Netlib BLAS and LAPACK are installed at /pfs/sw/serial/gcc/lapack-3.5.0.

To link with the Netlib BLAS library, using gfortran:

$ gfortran -o blaspgm.x blaspgm.f -L/pfs/sw/serial/gcc/lapack-3.5.0/lib -lblas

To link with the Netlib BLAS library, using the Intel Fortran Compiler:

$ ifort -o blaspgm.x blaspgm.f -L/pfs/sw/serial/gcc/lapack-3.5.0/lib -lblas -lgfortran

There are C and C++ interfaces to BLAS. It is also possible and popular to call the Fortran BLAS from C and C++.

Calling Fortran BLAS from C

Fortran subroutines are the equivalent of C functions returning void. When compiling, most Fortran compilers append an underscore (_) to the subroutine name^[3]. For example^[4]:

$ nm /pfs/sw/serial/gcc/lapack-3.5.0/lib/libblas.a | grep sgemm
sgemm.o:
0000000000000000 T sgemm_

To call, e.g., the Fortran subroutine sgemm (matrix matrix multiply) from C, first declare its prototype in the C code:

extern void sgemm_( char *, char *, int *, int *, int *, 
    float *, float *, int *, float *, int *, float *, float *, int * );

To compile a C program and link with the Netlib Fortran BLAS library, use the following flags:

-L/pfs/sw/serial/gcc/lapack-3.5.0/lib -lblas -lgfortran

Calling Fortran BLAS from C++

To call, e.g., the Fortran subroutine sgemm (matrix matrix multiply) from C, first declare its prototype in the C code:

extern "C" void sgemm_( char *, char *, int *, int *, int *, 
    float *, float *, int *, float *, int *, float *, float *, int * );

To compile a C++ program and link with the Netlib Fortran BLAS library, use the following flags:

-L/pfs/sw/serial/gcc/lapack-3.5.0/lib -lblas -lgfortran

Netlib CBLAS

Netlib also provides a reference implementation of C interface to the BLAS.

Download Netlib CBLAS tar ball:

$ cd /scratch
$ wget http://www.netlib.org/blas/blast-forum/cblas.tgz
$ tar xvfz cblas.tgz
$ cd CBLAS

Modify Makefile.in so that it reads as follows;

SHELL = /bin/sh
BLLIB = /pfs/sw/serial/gcc/lapack-3.5.0/lib/libblas.a
CBLIB = ../lib/libcblas.a

CC = gcc
FC = gfortran
LOADER = $(FC)
CFLAGS = -O3 -DADD_ -march=native -fPIC
FFLAGS = -O3 -march=native -fPIC

ARCH = ar
ARCHFLAGS = cr
RANLIB = ranlib

Compile CBLAS:

$ make

Install CBLAS:

$ cp -r include lib /pfs/sw/serial/gcc/lapack-3.5.0/

Netlib CBLAS is installed at /pfs/sw/serial/gcc/lapack-3.5.0 too.

To facilitate the usage of the Netlib libraries, I've created a module lapack/s_gcc_netlib_3.5.0 to set up their environment. If you load the module, you can use more concise commands to link with the Netlib libraries. For example:

$ module load lapack/s_gcc_netlib_3.5.0
$ gcc -o cblaspgm.x cblaspgm.c -lcblas -lblas -lgfortran

ATLAS

Main article: ATLAS

ATLAS (Automatically Tuned Linear Algebra Software) is an open source efficient and full implementation of BLAS APIs for C and Fortran 77. It also implements a a few routines from LAPACK. While its performance often trails that of specialized libraries written for one specific hardware platform, e.g., Intel MKL, it is a large improvement over the reference Netlib BLAS.

The ATLAS installation include libraries for BLAS, CBLAS, LAPACK and ATLAS's clapack^[5] (not to be confused with Netlib CLAPACK).

OpenBLAS

Main article: OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2. GotoBLAS, GotoBLAS2 and OpenBLAS are related implementations of the BLAS API with many hand-crafted optimizations for specific processor types. OpenBLAS adds optimized implementations of linear algebra kernels for several processor architectures, including Intel Sandy Bridge, which is the processor of choice for the Hyades cluster. It claims to achieve performance comparable to the Intel MKL.

The OpenBLAS library libopenblas.a contain object codes for all routines in BLAS, CBLAS, LAPACK, and LAPACKE.

Intel MKL

Main article: Intel MKL

Intel MKL (Math Kernel Library) is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math. The routines in MKL are hand-optimized specifically for Intel processors.

GSL

Main article: GSL

GSL (GNU Scientific Library) is a numerical library for C and C++ programmers. It provides a wide range of mathematical routines such as random number generators, special functions and least-squares fitting. GSL 1.16, compiled with GCC, is installed at /pfs/sw/serial/gcc/gsl-1.16.

GSL includes BLAS supports. To use the CBLAS library provided by GSL, include the appropriate GSL header in your C/C++ code:

#include <gsl/gsl_cblas.h>

To compile and link with GSL:

$ gcc -o cblaspgm.x cblaspgm.c -I/pfs/sw/serial/gcc/gsl-1.16/include \
    -L/pfs/sw/serial/gcc/gsl-1.16/lib -lgsl -lgslcblas

$ module load gsl
$ gcc -o cblaspgm.x cblaspgm.c -lgsl -lgslcblas

uBLAS

Boost includes uBLAS, a C++ template class library that provides BLAS level 1, 2, 3 functionality for dense, packed and sparse matrices. The design and implementation unify mathematical notation via operator overloading and efficient code generation via expression templates^[6].

There are a few uBLAS examples at http://www.guwi17.de/ublas/examples/. To compile, e.g., the C++ program for Example 6 (Solve a System of Linear Equations using GMRES):

$ g++ -o gmres.x main_gmres.cpp -I/pfs/sw/serial/gcc/boost-1.57.0/include

Note uBLAS is a header-only library.

GPU accelerated BLAS

cuBLAS

The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. It, however, is not a drop-in replacement of standard BLAS. One must use the cuBLAS API or the newer cuBLAS-XT API to access the cuBLAS Library. Consult cuBLAS User Guide for details.

NVBLAS

The NVBLAS library is a drop-in replacement of standard BLAS. It can accelerate most BLAS Level-3 routines by dynamically routing BLAS calls to one or more NVIDIA GPUs present in the system, when the characteristics of the call make it to speedup on a GPU. NVBLAS is built on top of the cuBLAS Library using only the CUBLASXT API. NVBLAS also requires the presence of a CPU BLAS library on the system. Consult NVBLAS User Guide for details.