Perf - tingxingdong/clBLAS-private GitHub Wiki

clBLAS client program

The clBLAS client program is found in the ./src/client subdirectory. This program is more than just a sample application demonstrating the use of the BLAS library. The client program supports various capabilities including performance measurement. In general, the client program can invoke a user specified type of BLAS operation and report its performance.

The block below shows the help message given by the client program listing all the command line options. These options can be used to input various parameters and control the type of BLAS operation.

F:\code\GitHub\kknox\bin\clBLAS\develop\vs10x64\package\bin64> .\client.exe --help
clBLAS client command line options:
  -h [ --help ]                 produces this help message
  -g [ --gpu ]                  Force instantiation of an OpenCL GPU device
  -c [ --cpu ]                  Force instantiation of an OpenCL CPU device
  -a [ --all ]                  Force instantiation of all OpenCL devices
  --useimages                   Use an image-based kernel
  -m [ --sizem ] arg (=128)     number of rows in A and C
  -n [ --sizen ] arg (=128)     number of columns in B and C
  -k [ --sizek ] arg (=128)     number of columns in A and rows in B
  --lda arg (=0)                first dimension of A in memory. if set to 0,
                                lda will default to M (when transposeA is "no
                                transpose") or K (otherwise)
  --ldb arg (=0)                first dimension of B in memory. if set to 0,
                                ldb will default to K (when transposeB is "no
                                transpose") or N (otherwise)
  --ldc arg (=0)                first dimension of C in memory. if set to 0,
                                ldc will default to M
  --offA arg (=0)               offset of the matrix A in memory object
  --offBX arg (=0)              offset of the matrix B or vector X in memory
                                object
  --offCY arg (=0)              offset of the matrix C or vector Y in memory
                                object
  --alpha arg (=1)              specifies the scalar alpha
  --beta arg (=1)               specifies the scalar beta
  -o [ --order ] arg (=0)       0 = row major, 1 = column major
  --transposeA arg (=0)         0 = no transpose, 1 = transpose, 2 = conjugate
                                transpose
  --transposeB arg (=0)         0 = no transpose, 1 = transpose, 2 = conjugate
                                transpose
  -f [ --function ] arg (=gemm) BLAS function to test. Options: gemm, trsm,
                                trmm, gemv, symv, syrk, syr2k
  -r [ --precision ] arg (=s)   Options: s,d,c,z
  --side arg (=0)               0 = left, 1 = right. only used with [list of
                                function families]
  --uplo arg (=0)               0 = upper, 1 = lower. only used with [list of
                                function families]
  --diag arg (=0)               0 = unit diagonal, 1 = non unit diagonal. only
                                used with [list of function families]
  -p [ --profile ] arg (=20)    Time and report the kernel speed (default:
                                profiling off)

Examples are shown below; first example is invoking a single precision GEMM routine. All values are at their defaults.

F:\code\GitHub\kknox\bin\clBLAS\develop\vs10x64\package\bin64> .\client.exe
        StatisticalTimer:: Pruning 0 samples from clfunc
        StatisticalTimer:: Pruning 0 samples from clGemm
BLAS kernel execution time < ns >: 241963
BLAS kernel execution Gflops < 2.0*M*N*K/time >: 17.3345

Next example shows a 2048x2048 double precision TRSM routine with A as an upper-triangular non-unit diagonal matrix reporting the average time of 20 runs with outliers pruned.

F:\code\GitHub\kknox\bin\clBLAS\develop\vs10x64\package\bin64> .\client.exe -m 2048 -n 2048 -f trsm --uplo 1 --diag 1 -r d -p 20
        StatisticalTimer:: Pruning 0 samples from clfunc
        StatisticalTimer:: Pruning 0 samples from clTrsm
BLAS kernel execution time < ns >: 5.03849e+007
BLAS kernel execution Gflops < M*(M+1)*N/time >: 170.569

Python Dependencies

Python version 2.7.x is supported.

To use these scripts, you will need to download and install the 32-BIT VERSION of:

Python 2.7 x86 (32-bit) - http://www.python.org/download/releases/2.7.1

you will also need the 32-BIT VERSIONS of the following packages as not all the packages are available in 64bit at the time of this writing The ActiveState python distribution is recommended for windows (make sure to get the python 2.7-compatible packages):

NumPy 1.5.1 (32-bit, 64-bit unofficial, supports Python 2.4 - 2.7 and 3.1 - 3.2.) - http://sourceforge.net/projects/numpy/files/NumPy/
matplotlib 1.0.1 (32-bit & 64-bit, supports Python 2.4 - 2.7) - http://sourceforge.net/projects/matplotlib/files/matplotlib/

For ActiveState Python, all that one should need to type is 'pypm install matplotlib'

Python performance scripts

While it is convenient to be able to time a particular function with a given set of parameters, it is even better to be able to generate a plot of performance over a range of parameters. clBLAS can generate performance plots with the help of Python scripts. The python scripts are located at ./src/scripts/perf, but when the INSTALL target is built from the build environment the scripts are copied into the ./bin/clBLAS/develop/vs10x64/package directory along with the rest of the built binaries.

The are two primary python scripts that are user interactable.

measurePerformance.py

This script is responsible for measuring, gathering performance data and recording it in a log file. This script calls the client program in a loop, modifying program parameters in an organized fashion and scrapes stdOut for performance information. It provides a sophisticated interface that simplifies specifying test ranges and strides. It provides for extensive help information with the --help parameter

F:\code\GitHub\kknox\bin\clBLAS\develop\vs10x64\package\bin64> python .\measurePerformance.py --help
usage: measurePerformance.py [-h] [--device DEVICE] [-m SIZEM]
                                       [-n SIZEN] [-k SIZEK] [-s SQUARE]
                                       [--problemsize PROBLEMSIZE] [--lda LDA]
                                       [--ldb LDB] [--ldc LDC] [--offa OFFA]
                                       [--offb OFFB] [--offc OFFC] [-a ALPHA]
                                       [-b BETA] [-f FUNCTION] [-r PRECISION]
                                       [-o ORDER] [--transa TRANSA]
                                       [--transb TRANSB] [--side SIDE]
                                       [--uplo UPLO] [--diag DIAG]
                                       [--library LIBRARY] [--label LABEL]
                                       [--tablefile TABLEOUTPUTFILENAME]
                                       [--createini CREATEINIFILENAME | --ini USEINIFILENAME]

Measure performance of the clAmdBlas library

optional arguments:
  -h, --help            show this help message and exit
  --device DEVICE       device(s) to run on; may be a comma-delimited list.
                        choices are ['gpu', 'cpu']. (default gpu)
  -m SIZEM, --sizem SIZEM
                        size(s) of m to test; may include ranges and comma-
                        delimited lists. stepping may be indicated with a
                        colon. e.g., 1024 or 100-800:100 or 15,2048-3000
  -n SIZEN, --sizen SIZEN
                        size(s) of n to test; may include ranges and comma-
                        delimited lists. stepping may be indicated with a
                        colon. e.g., 1024 or 100-800:100 or 15,2048-3000
  -k SIZEK, --sizek SIZEK
                        size(s) of k to test; may include ranges and comma-
                        delimited lists. stepping may be indicated with a
                        colon. e.g., 1024 or 100-800:100 or 15,2048-3000
  -s SQUARE, --square SQUARE
                        size(s) of m=n=k to test; may include ranges and
                        comma-delimited lists. stepping may be indicated with
                        a colon. this option sets lda = ldb = ldc to the
                        values indicated with --lda for all problems set with
                        --square. e.g., 1024 or 100-800:100 or 15,2048-3000
  --problemsize PROBLEMSIZE
                        additional problems of a set size. may be used in
                        addition to sizem/n/k and lda/b/c. each indicated
                        problem size will be added to the list of problems to
                        complete. should be entered in MxNxK:AxBxC format
                        (where :AxBxC specifies lda/b/c. :AxBxC is optional.
                        if included, lda/b/c are subject to the same range
                        restrictions as indicated in the lda/b/c section of
                        this help. if omitted, :0x0x0 is assumed). may enter
                        multiple in a comma-delimited list. e.g.,
                        2x2x2:4x6x9,3x3x3 or 1024x800x333
  --lda LDA             value of lda; may include ranges and comma-delimited
                        lists. stepping may be indicated with a colon. if
                        transA = 'n', lda must be >= 'm'. otherwise, lda must
                        be >= 'k'. if this is violated, the problem will be
                        skipped. if lda is 0, it will be automatically set to
                        match either 'm' (if transA = 'n') or 'k' (otherwise).
                        may indicate relative size with +X, where X is the
                        offset relative to M or K (depending on transA). e.g.,
                        1024 or 100-800:100 or 15,2048-3000 or +10 (if transA
                        = 'n' and M = 100, lda = 110) (default 0)
  --ldb LDB             value of ldb; may include ranges and comma-delimited
                        lists. stepping may be indicated with a colon. if
                        transB = 'n', ldb must be >= 'k'. otherwise, ldb must
                        be >= 'n'. if this is violated, the problem will be
                        skipped. if ldb is 0, it will be automatically set to
                        match either 'k' (if transB = 'n') or 'n' (otherwise).
                        may indicate relative size with +X, where X is the
                        offset relative to K or N (depending on transB). e.g.,
                        1024 or 100-800:100 or 15,2048-3000 or +100 (if transB
                        = 'n' and K = 2000, ldb = 2100) (default 0)
  --ldc LDC             value of ldc; may include ranges and comma-delimited
                        lists. stepping may be indicated with a colon. ldc
                        must be >= 'm'. if this is violated, the problem will
                        be skipped. if ldc is 0, it will be automatically set
                        to match 'm'. may indicate relative size with +X,
                        where X is the offset relative to M. e.g., 1024 or
                        100-800:100 or 15,2048-3000 or +5 (if M = 15, ldc =
                        20) (default 0)
  --offa OFFA           offset of the matrix A in memory; may include ranges
                        and comma-delimited lists. stepping may be indicated
                        with a colon. e.g., 0-31 or 100-128:2 or 42 (default
                        0)
  --offb OFFB           offset of the matrix B or vector X in memory; may
                        include ranges and comma-delimited lists. stepping may
                        be indicated with a colon. e.g., 0-31 or 100-128:2 or
                        42 (default 0)
  --offc OFFC           offset of the matrix C or vector Y in memory; may
                        include ranges and comma-delimited lists. stepping may
                        be indicated with a colon. e.g., 0-31 or 100-128:2 or
                        42 (default 0)
  -a ALPHA, --alpha ALPHA
                        specifies the scalar alpha
  -b BETA, --beta BETA  specifies the scalar beta
  -f FUNCTION, --function FUNCTION
                        indicates the function(s) to use. may be a comma
                        delimited list. choices are ['gemm', 'trmm', 'trsm',
                        'syrk', 'syr2k', 'gemv', 'symv'] (default gemm)
  -r PRECISION, --precision PRECISION
                        specifies the precision for the function. may be a
                        comma delimited list. choices are ['s', 'd', 'c', 'z']
                        (default s)
  -o ORDER, --order ORDER
                        select row or column major. may be a comma delimited
                        list. choices are ['row', 'column'] (default row)
  --transa TRANSA       select none, transpose, or conjugate transpose for
                        matrix A. may be a comma delimited list. choices are
                        ['none', 'transpose', 'conj'] (default none)
  --transb TRANSB       select none, transpose, or conjugate transpose for
                        matrix B. may be a comma delimited list. choices are
                        ['none', 'transpose', 'conj'] (default none)
  --side SIDE           select side, left or right for TRMM and TRSM. may be a
                        comma delimited list. choices are ['left', 'right']
                        (default left)
  --uplo UPLO           select uplo, upper or lower triangle. may be a comma
                        delimited list. choices are ['upper', 'lower']
                        (default upper)
  --diag DIAG           select diag, whether set diagonal elements to one. may
                        be a comma delimited list. choices are ['unit',
                        'nonunit'] (default unit)
  --library LIBRARY     indicates the library to use. choices are
                        ['clamdblas'] (default clamdblas)
  --label LABEL         a label to be associated with all transforms performed
                        in this run. if LABEL includes any spaces, it must be
                        in "double quotes". note that the label is not saved
                        to an .ini file. e.g., --label cayman may indicate
                        that a test was performed on a cayman card or --label
                        "Windows 32" may indicate that the test was performed
                        on Windows 32
  --tablefile TABLEOUTPUTFILENAME
                        save the results to a plaintext table with the file
                        name indicated. this can be used with
                        plotPerformance.py to generate graphs of the
                        data (default: table prints to screen)
  --createini CREATEINIFILENAME
                        create an .ini file with the given name that saves the
                        other parameters given at the command line, then quit.
                        e.g., 'measurePerformance.py -m 10 -n 100 -k
                        1000-1010 -f sgemm --createini my_favorite_setup.ini'
                        will create an .ini file that will save the
                        configuration for an sgemm of the indicated sizes.
  --ini USEINIFILENAME  use the parameters in the named .ini file instead of
                        the command line parameters.

An example of using this script to gather data is illustrated below; run gemm (the default) for single precision (default) square shapes for all power-of-2 sizes between 64 to 1024

F:\code\GitHub\kknox\bin\clBLAS\develop\vs10x64\package\bin64> python .\measurePerformance.py -s 64-1024:x2
A subdirectory or file perfLog already exists.
=========================MEASURE PERFORMANCE START===========================
Process id of Measure Performance:4988
Executing measure performance for label: None
Total combinations = 5
preparing command: 1
Executing Command: ['client.exe', '--gpu', '-m', '64', '-n', '64', '-k', '64', '--lda', '0', '--ldb', '0'
, '--ldc', '0', '--offA', '0', '--offBX', '0', '--offCY', '0', '--alpha', '1.0', '--beta', '1.0', '--order', '0', '
--transposeA', '0', '--transposeB', '0', '--side', '0', '--uplo', '0', '--diag', '0', '--function', 'gemm', '--prec
ision', 's', '-p', '10']
stdout:
BLAS kernel execution time < ns >: 168078
BLAS kernel execution Gflops < 2.0*M*N*K/time >: 3.11932

stderr:
        StatisticalTimer:: Pruning 0 samples from clfunc
        StatisticalTimer:: Pruning 0 samples from clGemm

Execution Successfull---------------

preparing command: 2
Executing Command: ['client.exe', '--gpu', '-m', '128', '-n', '128', '-k', '128', '--lda', '0', '--ldb',
'0', '--ldc', '0', '--offA', '0', '--offBX', '0', '--offCY', '0', '--alpha', '1.0', '--beta', '1.0', '--order', '0'
, '--transposeA', '0', '--transposeB', '0', '--side', '0', '--uplo', '0', '--diag', '0', '--function', 'gemm', '--p
recision', 's', '-p', '10']
stdout:
BLAS kernel execution time < ns >: 241691
BLAS kernel execution Gflops < 2.0*M*N*K/time >: 17.354

stderr:
        StatisticalTimer:: Pruning 0 samples from clfunc
        StatisticalTimer:: Pruning 0 samples from clGemm

Execution Successfull---------------

This generates a log file in the current directory that contains the details of all the parameters testes with the performance number

F:\code\GitHub\kknox\bin\clBLAS\develop\vs10x64\package\bin64> cat .\results2013-07-19T10.27.42.487000.txt
m,n,k,lda,ldb,ldc,offa,offb,offc,alpha,beta,order,transa,transb,side,uplo,diag,function,device,library,label,GFLOPS
64,64,64,0,0,0,0,0,0,1.0,1.0,row,none,none,left,upper,unit,sgemm,gpu,clblas,None,3.11932
128,128,128,0,0,0,0,0,0,1.0,1.0,row,none,none,left,upper,unit,sgemm,gpu,clblas,None,17.354
256,256,256,0,0,0,0,0,0,1.0,1.0,row,none,none,left,upper,unit,sgemm,gpu,clblas,None,82.6617
512,512,512,0,0,0,0,0,0,1.0,1.0,row,none,none,left,upper,unit,sgemm,gpu,clblas,None,327.464
1024,1024,1024,0,0,0,0,0,0,1.0,1.0,row,none,none,left,upper,unit,sgemm,gpu,clblas,None,867.143

This log file is then fed into the plotPerformance.py script, which consumes the records and plots the results in a graph.

plotPerformance.py

While the logfile generated from measurePerformance is sufficient for gathering performance data, it is nice to be able to generate plots with the data to be able to easily compare and contrast different sets of data. This is the purpose of plotPerformance.py; this python script uses the python matplotlib ( freely available ) library to either open a window into an interactive graph, or create an image file straight to disk. It provides for extensive help information with the --help parameter

F:\code\GitHub\kknox\bin\clBLAS\develop\vs10x64\package\bin64> python .\plotPerformance.py --help
usage: plotPerformance.py [-h] -d DATAFILE -x {sizem,sizen,sizek}
                                    [-y {gflops}]
                                    [--plot {lda,ldb,ldc,sizek,device,label,order,transa,transb,function,library}]
                                    [--title GRAPHTITLE]
                                    [--x_axis_label XAXISLABEL]
                                    [--x_axis_scale {linear,log2,log10}]
                                    [--y_axis_label YAXISLABEL]
                                    [--outputfile OUTPUTFILENAME]

Plot performance of the clBLAS library. plotPerformance.py reads
in data tables from  measurePerformance.py and plots their values

optional arguments:
  -h, --help            show this help message and exit
  -d DATAFILE, --datafile DATAFILE
                        indicate a file to use as input. must be in the format
                        output by measurePerformance.py. may be used
                        multiple times to indicate multiple input files. e.g.,
                        -d cypressOutput.txt -d caymanOutput.txt
  -x {sizem,sizen,sizek}, --x_axis {sizem,sizen,sizek}
                        indicate which value will be represented on the x
                        axis. problemsize is defined as x*y*z*batchsize
  -y {gflops}, --y_axis {gflops}
                        indicate which value will be represented on the y axis
  --plot {lda,ldb,ldc,sizek,device,label,order,transa,transb,function,library}
                        indicate which of ['lda', 'ldb', 'ldc', 'sizek',
                        'device', 'label', 'order', 'transa', 'transb',
                        'function', 'library'] should be used to differentiate
                        multiple plots. this will be chosen automatically if
                        not specified
  --title GRAPHTITLE    the desired title for the graph generated by this
                        execution. if GRAPHTITLE contains any spaces, it must
                        be entered in "double quotes". if this option is not
                        specified, the title will be autogenerated
  --x_axis_label XAXISLABEL
                        the desired label for the graph's x-axis. if
                        XAXISLABEL contains any spaces, it must be entered in
                        "double quotes". if this option is not specified, the
                        x-axis label will be autogenerated
  --x_axis_scale {linear,log2,log10}
                        the desired scale for the graph's x-axis. if nothing
                        is specified, it will be selected automatically
  --y_axis_label YAXISLABEL
                        the desired label for the graph's y-axis. if
                        YAXISLABEL contains any spaces, it must be entered in
                        "double quotes". if this option is not specified, the
                        y-axis label will be autogenerated
  --outputfile OUTPUTFILENAME
                        name of the file to output graphs. Supported formats:
                        emf, eps, pdf, png, ps, raw, rgba, svg, svgz.

Once the performance of a particular run has been saved to a log file, you can instruct plotPerformance to parse the log file and create a line graph from that data. Multiple log files can be read, and this creates the ability to compare numbers easily against each other.
The graph below compares the single and double precision performance of a GEMM function

F:\code\GitHub\kknox\bin\clBLAS\develop\vs10x64\package\bin64> python .\plotPerformance.py -x sizem --plot function -d .\results2013-07-19T13.52.41.272000.txt -d .\results2013-07-19T13.53.26.528000.txt --title "Single vs. Double precision GEMM"
m,n,k,lda,ldb,ldc,offa,offb,offc,alpha,beta,order,transa,transb,side,uplo,diag,function,device,library,label,GFLOPS
m,n,k,lda,ldb,ldc,offa,offb,offc,alpha,beta,order,transa,transb,side,uplo,diag,function,device,library,label,GFLOPS

True
m,n,k,lda,ldb,ldc,offa,offb,offc,alpha,beta,order,transa,transb,side,uplo,diag,function,device,library,label,GFLOPS
m,n,k,lda,ldb,ldc,offa,offb,offc,alpha,beta,order,transa,transb,side,uplo,diag,function,device,library,label,GFLOPS

True

and the graph should look very similar to Single vs. Double precision

To AMD NDA Customers

For AMD's NDA customers, there are optimized OpenCL compiler flags which are only usable by the NDA driver and therefore only available to NDA customers. The flags/driver combination improves performance of sgemm and dgemm.

For AMD's NDA customers, there is proprietary tool to set the GPU clock stable. This tool is required to achieve optimal and stable performance.

Selected Performance Graphs

Below are some graphs showing clBLAS 2.8.0 performance on s9150 with 14.50.2 driver as well as cuBLAS 7.5 performance on K40. GPU clock speeds were set to max for both devices.

SGEMM SGEMM_S9150_K40