Writing QUDA Kernels - lattice/quda GitHub Wiki
This page describes the QUDA convention in how to add new kernels to QUDA. This serves as both best-practice advice for kernel performance, a style guide to ensure uniformity across QUDA, and also to promote an abstracted programming style that emphasises readability and code portability.
This documentation will be split into two relatively separate parts,
- Part 1: A colloquial, somewhat informal, example driven explanation of how to write or modify a QUDA kernel
- Part 2: A more formal, abstract, and general explanation of how QUDA kernels are written.
Users new to QUDA and/or users new to GPU computing are advised to read Part 1 before tackling Part 2 so that they can familiarise themselves with the vernacular and common structures used in QUDA. Experienced programmers may find reading Part 1 useful, but may also find that they can write a QUDA kernel immediately after reading Part 2.
Broadly speaking, there are three types of QUDA kernel one can write:
- Pure fermion
- Pure gauge
- Both fermion and gauge
For example, the contraction kernel routines are pure fermion. Here, one need only load fermion fields and an array to store the results of a contraction. These are the simplest types of QUDA kernel there are as they do not involve any form of communication: all the data one needs for a contraction is already on the MPI node.
As a second example, consider the stout-smearing kernel. This is a pure-gauge routine, and loading gauge fields to QUDA kernels is a touch more involved. Furthermore, gauge fields routines generally involve using 'staples' made up of gauge field links that reside between adjacent lattice points. This naturally involves communication for lattice spaces at the MPI boundary, which QUDA handles using extended fields (as opposed to explicit ghost zone).
Thirdly, we will look at the Wilson dslash kernel. This is a kernel that uses both fermion and gauge fields (though the gauge fields will remain unchanged during the execution of the Wilson kernel.) We will see how QUDA uses ghost zones to overlap communications and computations of the fermion field.
Let us then follow the code path from the interface_quda.cpp
routine void contractQuda(...)
to see what is involved in writing a pure fermion kernel. This will be an 'stream of consciousness` introduction to a QUDA kernel, with a linear narrative following the code path. For a more abstract explanation, please refer to the summary.
The contractQuda
routine in defined lib/interface_quda.cpp
and declared in include/quda.h
accepts three pointers and some meta data. The pointers are the two fermion fields to be contracted and an array that will store the result of the contraction, which is the 16 \mu \nu contractions associated with the operation \Sum_{a} \phi_{a,\mu, X} \psi_{a,\nu, X} per lattice site.
void contractQuda(const void *hp_x, const void *hp_y, void *h_result, const QudaContractType cType,
QudaInvertParam *param, const int *X)
{
This is a host function. It is passed pointers to arrays defined on the host. Its function is to move the data from the host to the device, call the GPU functions that will perform a contraction, then move the results from the device back to the host. We will go through it line my line so as to leave no stone unturned.
profileContract.TPSTART(QUDA_PROFILE_TOTAL);
profileContract.TPSTART(QUDA_PROFILE_INIT);
These objects are defined in the interface_quda.cpp
file. They will collect timing data for the various aspects of the function, such as time spent computing, moving data, etc. We start the profiling with TPSTART(QUDA_PROFILE_TOTAL)
and then measure specific aspects with TPSTART(QUDA_PROFILE_<aspect>)
. This is the initialisation.
// wrap CPU host side pointers
ColorSpinorParam cpuParam((void *)hp_x, *param, X, false, param->input_location);
ColorSpinorField *h_x = ColorSpinorField::Create(cpuParam);
We begin by creating a QUDA object ColorSpinorParam
that will collect the data it needs to make a ColorSpinorField
object. It is given, as one of its arguments, a pointer to the host data hp_x
. Thus, the resulting ColorSpinorField
pointer *h_x
points to the data associated with hp_x
. We change the ColorSpinorParam
pointer to the other host pointer
cpuParam.v = (void *)hp_y;
ColorSpinorField *h_y = ColorSpinorField::Create(cpuParam);
in order to create a ColorSpinorField
wrapper for hp_y
. This is just done so that we may wrap the raw lattice data in a convenient QUDA object. Next, we create a ColorSpinorParam
object that will describe ColorSpinorField
objects on the GPU.
// Create device parameter
ColorSpinorParam cudaParam(cpuParam);
cudaParam.location = QUDA_CUDA_FIELD_LOCATION;
cudaParam.create = QUDA_NULL_FIELD_CREATE;
// Quda uses Degrand-Rossi gamma basis for contractions and will
// automatically reorder data if necessary.
cudaParam.gammaBasis = QUDA_DEGRAND_ROSSI_GAMMA_BASIS;
cudaParam.setPrecision(cpuParam.Precision(), cpuParam.Precision(), true);
Notice some important aspects are changed. After cloning the cpuParam
object, we change the location to QUDA_CUDA_FIELD_LOCATION
. This means that ColorSpinorField
objects created using cudaParam
will live on device memory. cudaParam.create = QUDA_NULL_FIELD_CREATE;
simply means that the data does not need to be initialised. We declare that the GPU data should be in QUDA_DEGRAND_ROSSI_GAMMA_BASIS
order, and then we choose the GPU precision. Finally, we create the GPU ColorSpinorField
objects.
std::vector<ColorSpinorField *> x, y;
x.push_back(ColorSpinorField::Create(cudaParam));
y.push_back(ColorSpinorField::Create(cudaParam));
We use std::vector
so that future developers may more easily extend this code to handle multiple vectors. We next create some device memory to store the result of the contraction.
size_t data_bytes = x[0]->Volume() * x[0]->Nspin() * x[0]->Nspin() * 2 * x[0]->Precision();
void *d_result = pool_device_malloc(data_bytes);
profileContract.TPSTOP(QUDA_PROFILE_INIT);
This competes the initialisation. Next, we transfer the host lattice data to the device.
profileContract.TPSTART(QUDA_PROFILE_H2D);
*x[0] = *h_x;
*y[0] = *h_y;
profileContract.TPSTOP(QUDA_PROFILE_H2D);
Yes... that's right. No... I didn't miss out any code. QUDA's ColorSpinorField
objects are very smart indeed! The objects on the right are on the host. The objects on the left are on the device. If the =
operator is used for ColorSpinorField
objects that are located on different memory spaces, it is implied that that a data transfer must take place, and it is done automatically for the user. We next perform the GPU computation which is explained in the next section,
profileContract.TPSTART(QUDA_PROFILE_COMPUTE);
contractQuda(*x[0], *y[0], d_result, cType);
profileContract.TPSTOP(QUDA_PROFILE_COMPUTE);
and finally we transfer (manually this time!) the device result to the host:
profileContract.TPSTART(QUDA_PROFILE_D2H);
qudaMemcpy(h_result, d_result, data_bytes, cudaMemcpyDeviceToHost);
profileContract.TPSTOP(QUDA_PROFILE_D2H);
Last, we be good programmers and clean up memory,
profileContract.TPSTART(QUDA_PROFILE_FREE);
pool_device_free(d_result);
delete x[0];
delete y[0];
delete h_y;
delete h_x;
profileContract.TPSTOP(QUDA_PROFILE_FREE);
profileContract.TPSTOP(QUDA_PROFILE_TOTAL);
}
That is about as simple as a QUDA function can get, but all the main points are there:
- Wrap raw host data with a convenient QUDA object.
- Transfer host data to the device
- Perform the GPU compute
- Transfer the result bask to the host
- Clean up memory.
We will see more complicated functions later, but for now this is enough. Let us then move on to the GPU function.
The interface function has called the GPU function contractQuda
. Here is is in full:
void contractQuda(const ColorSpinorField &x, const ColorSpinorField &y, void *result, const QudaContractType cType)
{
#ifdef GPU_CONTRACT
checkPrecision(x, y);
if (x.GammaBasis() != QUDA_DEGRAND_ROSSI_GAMMA_BASIS || y.GammaBasis() != QUDA_DEGRAND_ROSSI_GAMMA_BASIS)
errorQuda("Unexpected gamma basis x=%d y=%d", x.GammaBasis(), y.GammaBasis());
if (x.Ncolor() != 3 || y.Ncolor() != 3) errorQuda("Unexpected number of colors x=%d y=%d", x.Ncolor(), y.Ncolor());
if (x.Nspin() != 4 || y.Nspin() != 4) errorQuda("Unexpected number of spins x=%d y=%d", x.Nspin(), y.Nspin());
if (x.Precision() == QUDA_SINGLE_PRECISION) {
contract_quda<float>(x, y, (complex<float> *)result, cType);
} else if (x.Precision() == QUDA_DOUBLE_PRECISION) {
contract_quda<double>(x, y, (complex<double> *)result, cType);
} else {
errorQuda("Precision %d not supported", x.Precision());
}
#else
errorQuda("Contraction code has not been built");
#endif
}
The first thing to notice is that the body of the code is encapsulated in a preprocessor definition, which one can adjust at CMake time. If the QUDA_CONTRACT
CMake option is not set, none of the following code will be compiled. This is an important feature to include as it keeps compile time to a minimum. After some checks, we see that the contract_quda
function is instantiated on the desired precision:
template <typename real>
void contract_quda(const ColorSpinorField &x, const ColorSpinorField &y, complex<real> *result,
const QudaContractType cType)
{
ContractionArg<real> arg(x, y, result);
Contraction<real, ContractionArg<real>> contraction(arg, x, y, cType);
contraction.apply(0);
qudaDeviceSynchronize();
}
This deceptively simple function executes the entire contraction kernel. We must study it line by line.
ContractionArg<real> arg(x, y, result);
This line is little more than an object that will store the pointers to data on which we will work. You can see that it accepts the x
and y
fields (the two fermions to be contracted) and the result. It is templated on the precision. Here is the full contractionArg structure,
which is located in quda/include/kernels/contraction.cuh
template <typename real> struct ContractionArg {
int threads; // number of active threads required
int X[4]; // grid dimensions
static constexpr int nSpin = 4;
static constexpr int nColor = 3;
static constexpr bool spin_project = true;
static constexpr bool spinor_direct_load = false; // false means texture load
// Create a typename F for the ColorSpinorField (F for fermion)
typedef typename colorspinor_mapper<real, nSpin, nColor, spin_project, spinor_direct_load>::type F;
F x;
F y;
matrix_field<complex<real>, nSpin> s;
ContractionArg(const ColorSpinorField &x, const ColorSpinorField &y, complex<real> *s) :
threads(x.VolumeCB()),
x(x),
y(y),
s(s, x.VolumeCB())
{
for (int dir = 0; dir < 4; dir++) X[dir] = x.X()[dir];
}
};
This structure is fairly complex, but none of the elements are beyond the comprehension of a student of lattice gauge theory with a rudimentary knowledge of GPU computing. We first define the variables in the argument structure:
int threads; // number of active threads required
Naturally, this is the number of threads in the block
int X[4]; // grid dimensions
This is the number of lattice sites on the MPI node
static constexpr int nSpin = 4;
static constexpr int nColor = 3;
static constexpr bool spin_project = true;
static constexpr bool spinor_direct_load = false; // false means texture load
These vales are hardcoded to prevent template explosion. This particular kernel will only work for spin=4, color=3 fermion fields. The spin_project
boolean is a neat feature of QUDA that will automatically convert any gamma matrix ordered fermion into the desired gamma matrix order as it is loaded into the kernel. (KATE: do we need spinor_direct_load now that textures are eliminated?)
// Create a typename F for the ColorSpinorField (F for fermion)
typedef typename colorspinor_mapper<real, nSpin, nColor, spin_project, spinor_direct_load>::type F;
F x;
F y;
This object is a QUDA data type that will handle the way in which fermions are passed to the kernel. As you can see, it needs to know the precision, color, and spin of the fermion, and well as any operations (such as spin_project
) that need to be performed during the load. We give it a name F
.
matrix_field<complex<real>, nSpin> s;
This is a QUDA data type that will store the contraction. An nSpin X nSpin
matrix of type complex<real>
We now come to the constructor of this argument structure:
ContractionArg(const ColorSpinorField &x, const ColorSpinorField &y, complex<real> *s) :
threads(x.VolumeCB()),
x(x),
y(y),
s(s, x.VolumeCB())
{
for (int dir = 0; dir < 4; dir++) X[dir] = x.X()[dir];
}
This is what was called by the ContractionArg<real> arg(x, y, result);
line. threads
in is initialised with the checkerboarded sub-volume of a fermion field, the two fermion fields are, naturally, x, and y, and the result array is initialised with the address and the number of threads. The actual body of the argument structure is to then populate the X
array with the MPI lattice dimensions.
This argument structure provides the kernel with the pointers it needs to locate data, and the parameters it needs to perform it's computation.
Now that all the parameters and pointers have been established, we construct the kernel that will perform the computation:
Contraction<real, ContractionArg<real>> contraction(arg, x, y, cType);
This is a very subtle but powerful step. As you can see, the function named contraction
is a Contraction
type object, templetised on the specialised object ContractionArg
. This allows us to Bake-In data types such as
typedef typename colorspinor_mapper<real, nSpin, nColor, spin_project, spinor_direct_load>::type F;
and parameter values
static constexpr int nSpin = 4;
static constexpr int nColor = 3;
which make life easier for the NVCC compiler. This is a common strategy utilised by the majority of QUDA kernels.
The next thing to look at is the Contraction<real, ContractionArg<real>>
class itself. This is a fair chunk of abstract code to understand, so we will do so very carefully.
template <typename real, typename Arg> class Contraction : TunableVectorY
{
The first thing to notice is that this class inherits from TunableVectorY
. In fact, all of QUDA's kernel classes inherit from some form of Tunable
class. [see tuning section]. This allows QUDA to automagically experiment with block and thread dimensions to discover the optimal set-up for a given problem. The protected members of the class:
protected:
Arg &arg;
const ColorSpinorField &x;
const ColorSpinorField &y;
const QudaContractType cType;
are the argument structure we have already seen, the memory addresses of the two fermion fields to be contracted, and the type of contraction to compute. The private members:
private:
bool tuneGridDim() const { return false; } // Don't tune the grid dimensions.
unsigned int minThreads() const { return arg.threads; }
are explicit definitions of helper functions defined virtually in the Tunable
class. We must define them explicitly whenever we create a class that inherits from Tunable
. In this particular case, we instruct the autotuner to not tune the grid dimensions, and we also request that there be a minimum number of threads, given by arg.threads
. This value was set when we created the ContractionArg
structure. It was inferred from the checkerboard volume of one of the lattice fermions. Make sure you can identify in the body of the ContractionArg
code where this happened.
Next is the all important public interface of the class. The first method is the constructor:
public:
Contraction(Arg &arg, const ColorSpinorField &x, const ColorSpinorField &y, const QudaContractType cType) :
TunableVectorY(2),
arg(arg),
x(x),
y(y),
cType(cType)
{
switch (cType) {
case QUDA_CONTRACT_TYPE_OPEN: strcat(aux, "open,"); break;
case QUDA_CONTRACT_TYPE_DR: strcat(aux, "degrand-rossi,"); break;
default: errorQuda("Unexpected contraction type %d", cType);
}
strcat(aux, x.AuxString());
#ifdef JITIFY
create_jitify_program("kernels/contraction.cuh");
#endif
}
It is fairly standard for a constructor. We initialise the class from which we inherit TunableVectorY(2)
, then initialise the arg, vectors, and contraction type. The object aux
is from the Tunable
class. It is simply a string that defines the kernel for the autotuner. One the kernel has tuned, the autotuner will save the optimal set-up to a file and, if the same kernel (as defined by the aux
string) is called again, the autotuner will use the optimal setup immediately. The part that is encapsulated by the JITIFY
define is an instruction to the compiler that the kernel is to be compiled at run time [See section on JITIFY]
The destructor is about as boiler plate as it gets:
virtual ~Contraction() {}
Now we arrive at the kernel launch member function. This function is actually a member of the Tunable
class. This is how the autotuner is able to repeatedly launch the kernel with various parameters. When we call the apply
method we are not strictly asking the Contraction
object to apply the kernel, rather we are asking the Tunable
parent class to do it for us.
void apply(const qudaStream_t &stream)
{
if (x.Location() == QUDA_CUDA_FIELD_LOCATION) {
TuneParam tp = tuneLaunch(*this, getTuning(), getVerbosity());
QUDA is evolving so that the C++
code can be compiled by both CUDA and possibly a host compiler such as gcc
. Therefore, we place the CUDA specific code under a boolean x.Location() == QUDA_CUDA_FIELD_LOCATION
so we know we are using a GPU. We also define a TuneParam
object that is used by the Tunable
class.
#ifdef JITIFY
std::string function_name;
switch (cType) {
case QUDA_CONTRACT_TYPE_OPEN: function_name = "quda::computeColorContraction"; break;
case QUDA_CONTRACT_TYPE_DR: function_name = "quda::computeDegrandRossiContraction"; break;
default: errorQuda("Unexpected contraction type %d", cType);
}
using namespace jitify::reflection;
jitify_error = program->kernel(function_name)
.instantiate(Type<real>(), Type<Arg>())
.configure(tp.grid, tp.block, tp.shared_bytes, stream)
.launch(arg);
#else
This, again, is for when Just In Time compilation is used. [See section on JITIFY]
switch (cType) {
case QUDA_CONTRACT_TYPE_OPEN: computeColorContraction<real><<<tp.grid, tp.block, tp.shared_bytes>>>(arg); break;
case QUDA_CONTRACT_TYPE_DR:
computeDegrandRossiContraction<real><<<tp.grid, tp.block, tp.shared_bytes>>>(arg);
break;
default: errorQuda("Unexpected contraction type %d", cType);
}
Here we now, at long last, see the familiar triple chevron <<<...>>>
syntax of a CUDA kernel! Two in fact, one for each type of contraction. Each of these kernels contracts the fermions in slightly different ways. computeColorContraction
will contract only the colour indices, giving 16 complex numbers per lattice site, one for each \mu, and \nu combination. computeDegrandRossiContraction
will insert all 16 Gamma matrices into the contraction, and also giving 16 complex numbers, one for each unique gamma matrix combination. Here we also see how the autotuner works. The standard CUDA kernel arguments of blocks, threads, and shared memory size are controlled by the TuneParam tp
object.
#endif
} else {
errorQuda("CPU not supported yet\n");
}
}
Lastly, we see that even though QUDA is moving towards CPU compilation, not all kernels are as yet up to spec. These last few methods are fairly self explanatory:
TuneKey tuneKey() const { return TuneKey(x.VolString(), typeid(*this).name(), aux); }
void preTune() {}
void postTune() {}
long long flops() const
{
if (cType == QUDA_CONTRACT_TYPE_OPEN)
return 16 * 3 * 6ll * x.Volume();
else
return ((16 * 3 * 6ll) + (16 * (4 + 12))) * x.Volume();
}
long long bytes() const
{
return x.Bytes() + y.Bytes() + x.Nspin() * x.Nspin() * x.Volume() * sizeof(complex<real>);
}
};
The TuneKey tuneKey()
objects simply defines the kernel and it's launch parameters, the rest are used for bookkeeping so that QUDA can properly report on performance.
So, in terms of the deceptively simple function:
template <typename real>
void contract_quda(const ColorSpinorField &x, const ColorSpinorField &y, complex<real> *result,
const QudaContractType cType)
{
ContractionArg<real> arg(x, y, result);
Contraction<real, ContractionArg<real>> contraction(arg, x, y, cType);
contraction.apply(0);
qudaDeviceSynchronize();
}
have broadly covered the construction and application of the compute kernel. Let us now turn to the compute kernel itself to finish the story.
There are two compute kernels in this class, we shall explore computeColorContraction
as an example. The kernel is templetised on the data type (real
can be float
or double
) and the ContractionArg
which is given the typename Arg
. This is done to tha, for example, one can create a different type of ContractionArg
with say 4 colours and 1 spin.
template <typename real, typename Arg> __global__ void computeColorContraction(Arg arg)
{
int x_cb = threadIdx.x + blockIdx.x * blockDim.x;
int parity = threadIdx.y + blockIdx.y * blockDim.y;
The above are some familiar CUDA kernel variables denoting the thread and block indices. The .x
and .y
aspects are purely a convenience. Lattice spacetime indices can be split into ODD and EVEN, which are independent of each other for certain operations. When accessing the fermion data, we refer to a lattice point x
using the checkerboard index x_cb
and its parity (EVEN or ODD).
if (x_cb >= arg.threads) return;
The streaming multiprocessors must launch warps of size 32. If the lattice dimensions are not evenly divided by 32, then some threads will have x_cb
values that are larger than the lattice volume. If allowed to continue, these threads will attempt to access out-of-bounds memory, causing a segfault, hence the above (very important) line.
constexpr int nSpin = Arg::nSpin;
constexpr int nColor = Arg::nColor;
typedef ColorSpinor<real, nColor, nSpin> Vector;
Vector x = arg.x(x_cb, parity);
Vector y = arg.y(x_cb, parity);
Matrix<complex<real>, nSpin> A;
These are some variables and data types used in the computation. Notice two different ways in which data from the arg
structure is accessed. static constexpr
variables use the Arg::
method, while pointers to data use arg.
typedef ColorSpinor<real, nColor, nSpin> Vector;
is a convenience for a struct called ColorSpinor<real, nColor, nSpin>
. As its name implies, it is a QUDA object that contains the nColor x nSpin
complex components of precision real
. This object is populated using the x
and y
fermion vectors cast as colorspinor_mapper
objects defined in ContrcationArg
. This is how we instruct the kernel to load data from global memory space. The object A is a QUDA object, an nSpin X nSpin
matrix of complex. It will store our results.
#pragma unroll
for (int mu = 0; mu < nSpin; mu++) {
#pragma unroll
for (int nu = 0; nu < nSpin; nu++) {
// Color inner product: <\phi(x)_{\mu} | \phi(y)_{\nu}>
// The Bra is conjugated
A(mu, nu) = innerProduct(x, y, mu, nu);
}
}
arg.s.save(A, x_cb, parity);
}
This is the body of the compute. We have now pulled in data from the two fermion fields x
and y
. We populate the A array using a function called innerProduct
which is defined here. Once the loop is complete, we save the data to the output array, using a handy member function of the matrix_field
object s
defined in the argument structure. This will simply save the data in A
at the point specified by x_cb
and parity
.
This completes the exposition of the pure fermion contraction kernel. You have seen how one can take raw host data, move it to the GPU, perform a computation, then move results back to the host.