Design - tingxingdong/clBLAS-private GitHub Wiki
clBLAS high level design
In addition to returning accurate results from all BLAS routines, a top priority for the design of the clBLAS library is high performance. In the beginning of design and development, the design for clBLAS to achieve the highest performance was to dynamically generate OpenCL kernels on the fly, given a set of input parameters passed to a routine at runtime and tuning for the specific hardware in a users machine. This implies that the very first time a routine is called with the given set of parameters, the kernel is generated and compiled on the fly, resulting in a performance hit on the first call. Unfortunately, the traditional BLAS API does not have a mechanism to separate out a perperation step from the function call, which many FFT API's enjoy with the concept of a plan. However, all additional calls to the BLAS API execute very fast.
This design persisted during the development of the GEMM, TRMM, TRSM, SYRK & SYR2K L3 routines and two of the L2 routines, GEMV and SYMV. A tool was developed which could heuristically go through the important parameters for one of those routines and save off the generator parameters to generate an optimal kernel for the device under test.
However, as the development of the library continued with the rest of the L2 & L1 routines, it was decided that the dynamically generated kernels would see diminishing benefits as the majority of the L2 routines and the entirety of the L1 routines would not be compute bound. The maintanence cost of dynamic kernels is that they are harder to develop, maintain and debug, opposed to the more traditional static .cl files for OpenCL. Therefore, the decision was made to switch development design to the more rigid KPRINTF template renderer, which is easier to develop, maintain and debug but still allows for flexibility on datatypes which the dynamic kernel generator enjoys.
Common organization
From the internal organization point of view the library can be decomposed in 2 parts: frontend and backend. The frontend performs initial parameters checking, solving strategy choosing and kernel caching. The backend is a set of parametrized kernel generators. At a typical CLBLAS library function the following steps are performed:
- Parameters validation and translation into an internal representation.
- Distributing the problem among passed command queues
- Making math decomposition for each problem part in order to leverage faster kernels which are native for other functions
- Choosing the best kind of kernel to solve the problem with.
- Getting the best decomposition block sizes for the problem from the kernel database file.
- Checking if the kernel matching to problem flags is presented in the kernel cache. If not it tries to load the kernel from the kernel database file. Eventually, if this database is not available, or the kernel is not presented in it, in invokes a respective kernel generator. The obtained kernel is added then to the cache.
- Enqueueing kernels for all the gotten subproblems.
Problem decompositions
For all functions a block approach is used. This means that matrices are divided onto independent blocks processed sequantially. The size of blocks which a kernel should operate with are described with the SubproblemDim structure. It describes the amount of work that each work group or work item should perform and steps which the should process blocks with. The amount of work is expressed in terms of the output matrix chunk that should be updated. BLAS uses at most 2 level decomposition. The level 0 describes the work that the entire work group performs, and the level 1 describes the work for a single work item. A generator is free to select itself if it uses 1 or 2 levels. If 1 level decomposition is selected the front-end assumes that this is the work item related level. A decomposition may be 1D or 2D. This matches to the OpenCL execution model, i.e 1D decomposition matches to 1D work space as well as the 2D one matches to 2D work space.A generator specify supported decomposition dimensions with the SF_WSPACE_1D and SF_WSPACE_2D flags. Basically, the design assumes that both schemes may work out at the same time for the same generator. However, currently the front-end assumes that 2 decomposition is always better and prefers a 2D work space if the SF_WSPACE_2D flag is specified.
Submitting problems to the generic layer
A generic layer follows the library API level. It is designed to take care of common operations such as distributing work
among command queues, efficient math decomposition, etc. It deals with an abstract solution sequence incorporating
information about subproblems given in totality the initial problem, kernels, command queues and scratch resources
if they are needed. The top layer will first fill a CLBlasKargs structure assigning respective fields with
the values of function arguments, and then build a solution sequence calling the makeSolutionSeq()
function.
The sequence is dispatched with the executeSolutionSeq()
function. freeSolutionSeq()
frees all resources consumed
by the sequence.
Dynamic kernel generation
Overview
The following describes the design of the BLAS library dynamic kernel generator organization and the approaches it uses to optimize the library for different GPU-based hardware. The technique is known as AEOS, that means “Automated Empirical Optimization of Software” which makes it possible to produce optimal code automatically for a target platform benefitting as if from intricate hand tuning.