Writing QUDA Kernels - lattice/quda GitHub Wiki

This page describes the QUDA convention in how to add new kernels to QUDA. This serves as both best-practice advice for kernel performance, a style guide to ensure uniformity across QUDA, and also to promote an abstracted programming style that emphasises readability and code portability.

Overview

This documentation will be split into two relatively separate parts,

  1. Part 1: A colloquial, somewhat informal, example driven explanation of how to write or modify a QUDA kernel
  2. Part 2: A more formal, abstract, and general explanation of how QUDA kernels are written.

Users new to QUDA and/or users new to GPU computing are advised to read Part 1 before tackling Part 2 so that they can familiarise themselves with the vernacular and common structures used in QUDA. Experienced programmers may find reading Part 1 useful, but may also find that they can write a QUDA kernel immediately after reading Part 2.

Writing QUDA Kernels Part 1:

Broadly speaking, there are three types of QUDA kernel one can write:

  1. Pure fermion
  2. Pure gauge
  3. Both fermion and gauge

For example, the contraction kernel routines are pure fermion. Here, one need only load fermion fields and an array to store the results of a contraction. These are the simplest types of QUDA kernel there are as they do not involve any form of communication: all the data one needs for a contraction is already on the MPI node.

As a second example, consider the stout-smearing kernel. This is a pure-gauge routine, and loading gauge fields to QUDA kernels is a touch more involved. Furthermore, gauge fields routines generally involve using 'staples' made up of gauge field links that reside between adjacent lattice points. This naturally involves communication for lattice spaces at the MPI boundary, which QUDA handles using extended fields (as opposed to explicit ghost zone).

Thirdly, we will look at the Wilson dslash kernel. This is a kernel that uses both fermion and gauge fields (though the gauge fields will remain unchanged during the execution of the Wilson kernel.) We will see how QUDA uses ghost zones to overlap communications and computations of the fermion field.

Let us then follow the code path from the interface_quda.cpp routine void contractQuda(...) to see what is involved in writing a pure fermion kernel. This will be an 'stream of consciousness` introduction to a QUDA kernel, with a linear narrative following the code path. For a more abstract explanation, please refer to the summary.

contractQuda: A pure fermion field CPU function

The contractQuda routine in defined lib/interface_quda.cpp and declared in include/quda.h accepts three pointers and some meta data. The pointers are the two fermion fields to be contracted and an array that will store the result of the contraction, which is the 16 \mu \nu contractions associated with the operation \Sum_{a} \phi_{a,\mu, X} \psi_{a,\nu, X} per lattice site.

void contractQuda(const void *hp_x, const void *hp_y, void *h_result, const QudaContractType cType,
                  QudaInvertParam *param, const int *X)
{

This is a host function. It is passed pointers to arrays defined on the host. Its function is to move the data from the host to the device, call the GPU functions that will perform a contraction, then move the results from the device back to the host. We will go through it line my line so as to leave no stone unturned.

  profileContract.TPSTART(QUDA_PROFILE_TOTAL);
  profileContract.TPSTART(QUDA_PROFILE_INIT);

These objects are defined in the interface_quda.cpp file. They will collect timing data for the various aspects of the function, such as time spent computing, moving data, etc. We start the profiling with TPSTART(QUDA_PROFILE_TOTAL) and then measure specific aspects with TPSTART(QUDA_PROFILE_<aspect>). This is the initialisation.

  // wrap CPU host side pointers
  ColorSpinorParam cpuParam((void *)hp_x, *param, X, false, param->input_location);
  ColorSpinorField *h_x = ColorSpinorField::Create(cpuParam);

We begin by creating a QUDA object ColorSpinorParam that will collect the data it needs to make a ColorSpinorField object. It is given, as one of its arguments, a pointer to the host data hp_x. Thus, the resulting ColorSpinorField pointer *h_x points to the data associated with hp_x. We change the ColorSpinorParam pointer to the other host pointer

  cpuParam.v = (void *)hp_y;
  ColorSpinorField *h_y = ColorSpinorField::Create(cpuParam);

in order to create a ColorSpinorField wrapper for hp_y. This is just done so that we may wrap the raw lattice data in a convenient QUDA object. Next, we create a ColorSpinorParam object that will describe ColorSpinorField objects on the GPU.

  // Create device parameter
  ColorSpinorParam cudaParam(cpuParam);
  cudaParam.location = QUDA_CUDA_FIELD_LOCATION;
  cudaParam.create = QUDA_NULL_FIELD_CREATE;
  // Quda uses Degrand-Rossi gamma basis for contractions and will
  // automatically reorder data if necessary.
  cudaParam.gammaBasis = QUDA_DEGRAND_ROSSI_GAMMA_BASIS;
  cudaParam.setPrecision(cpuParam.Precision(), cpuParam.Precision(), true);

Notice some important aspects are changed. After cloning the cpuParam object, we change the location to QUDA_CUDA_FIELD_LOCATION. This means that ColorSpinorField objects created using cudaParam will live on device memory. cudaParam.create = QUDA_NULL_FIELD_CREATE; simply means that the data does not need to be initialised. We declare that the GPU data should be in QUDA_DEGRAND_ROSSI_GAMMA_BASIS order, and then we choose the GPU precision. Finally, we create the GPU ColorSpinorField objects.

  std::vector<ColorSpinorField *> x, y;
  x.push_back(ColorSpinorField::Create(cudaParam));
  y.push_back(ColorSpinorField::Create(cudaParam));

We use std::vector so that future developers may more easily extend this code to handle multiple vectors. We next create some device memory to store the result of the contraction.

  size_t data_bytes = x[0]->Volume() * x[0]->Nspin() * x[0]->Nspin() * 2 * x[0]->Precision();
  void *d_result = pool_device_malloc(data_bytes);
  profileContract.TPSTOP(QUDA_PROFILE_INIT);

This competes the initialisation. Next, we transfer the host lattice data to the device.

  profileContract.TPSTART(QUDA_PROFILE_H2D);
  *x[0] = *h_x;
  *y[0] = *h_y;
  profileContract.TPSTOP(QUDA_PROFILE_H2D);

Yes... that's right. No... I didn't miss out any code. QUDA's ColorSpinorField objects are very smart indeed! The objects on the right are on the host. The objects on the left are on the device. If the = operator is used for ColorSpinorField objects that are located on different memory spaces, it is implied that that a data transfer must take place, and it is done automatically for the user. We next perform the GPU computation which is explained in the next section,

  profileContract.TPSTART(QUDA_PROFILE_COMPUTE);
  contractQuda(*x[0], *y[0], d_result, cType);
  profileContract.TPSTOP(QUDA_PROFILE_COMPUTE);

and finally we transfer (manually this time!) the device result to the host:

  profileContract.TPSTART(QUDA_PROFILE_D2H);
  qudaMemcpy(h_result, d_result, data_bytes, cudaMemcpyDeviceToHost);
  profileContract.TPSTOP(QUDA_PROFILE_D2H);

Last, we be good programmers and clean up memory,

  profileContract.TPSTART(QUDA_PROFILE_FREE);
  pool_device_free(d_result);
  delete x[0];
  delete y[0];
  delete h_y;
  delete h_x;
  profileContract.TPSTOP(QUDA_PROFILE_FREE);
  profileContract.TPSTOP(QUDA_PROFILE_TOTAL);
}

That is about as simple as a QUDA function can get, but all the main points are there:

  1. Wrap raw host data with a convenient QUDA object.
  2. Transfer host data to the device
  3. Perform the GPU compute
  4. Transfer the result bask to the host
  5. Clean up memory.

We will see more complicated functions later, but for now this is enough. Let us then move on to the GPU function.

contractQuda: A pure fermion field GPU function.

The interface function has called the GPU function contractQuda. Here is is in full:

void contractQuda(const ColorSpinorField &x, const ColorSpinorField &y, void *result, const QudaContractType cType)                                                                                                                         
  {
#ifdef GPU_CONTRACT
    checkPrecision(x, y);

    if (x.GammaBasis() != QUDA_DEGRAND_ROSSI_GAMMA_BASIS || y.GammaBasis() != QUDA_DEGRAND_ROSSI_GAMMA_BASIS)
      errorQuda("Unexpected gamma basis x=%d y=%d", x.GammaBasis(), y.GammaBasis());
    if (x.Ncolor() != 3 || y.Ncolor() != 3) errorQuda("Unexpected number of colors x=%d y=%d", x.Ncolor(), y.Ncolor());
    if (x.Nspin() != 4 || y.Nspin() != 4) errorQuda("Unexpected number of spins x=%d y=%d", x.Nspin(), y.Nspin());

    if (x.Precision() == QUDA_SINGLE_PRECISION) {
      contract_quda<float>(x, y, (complex<float> *)result, cType);
    } else if (x.Precision() == QUDA_DOUBLE_PRECISION) {
      contract_quda<double>(x, y, (complex<double> *)result, cType);
    } else {
      errorQuda("Precision %d not supported", x.Precision());
    }

#else
    errorQuda("Contraction code has not been built");
#endif
  }

The first thing to notice is that the body of the code is encapsulated in a preprocessor definition, which one can adjust at CMake time. If the QUDA_CONTRACT CMake option is not set, none of the following code will be compiled. This is an important feature to include as it keeps compile time to a minimum. After some checks, we see that the contract_quda function is instantiated on the desired precision:

template <typename real>
  void contract_quda(const ColorSpinorField &x, const ColorSpinorField &y, complex<real> *result,
                     const QudaContractType cType)
  {
    ContractionArg<real> arg(x, y, result);
    Contraction<real, ContractionArg<real>> contraction(arg, x, y, cType);
    contraction.apply(0);
    qudaDeviceSynchronize();
  }

This deceptively simple function executes the entire contraction kernel. We must study it line by line.

Parameter argument struct

ContractionArg<real> arg(x, y, result);

This line is little more than an object that will store the pointers to data on which we will work. You can see that it accepts the x and y fields (the two fermions to be contracted) and the result. It is templated on the precision. Here is the full contractionArg structure, which is located in quda/include/kernels/contraction.cuh

template <typename real> struct ContractionArg {
    int threads; // number of active threads required                                                                                                                                                                                         
    int X[4];    // grid dimensions                                                                                                                                                                                                           

    static constexpr int nSpin = 4;
    static constexpr int nColor = 3;
    static constexpr bool spin_project = true;
    static constexpr bool spinor_direct_load = false; // false means texture load                                                                                                                                                             

    // Create a typename F for the ColorSpinorField (F for fermion)                                                                                                                                                                           
    typedef typename colorspinor_mapper<real, nSpin, nColor, spin_project, spinor_direct_load>::type F;

    F x;
    F y;
    matrix_field<complex<real>, nSpin> s;

    ContractionArg(const ColorSpinorField &x, const ColorSpinorField &y, complex<real> *s) :
      threads(x.VolumeCB()),
      x(x),
      y(y),
      s(s, x.VolumeCB())
    {
      for (int dir = 0; dir < 4; dir++) X[dir] = x.X()[dir];
    }
  };

This structure is fairly complex, but none of the elements are beyond the comprehension of a student of lattice gauge theory with a rudimentary knowledge of GPU computing. We first define the variables in the argument structure:

    int threads; // number of active threads required

Naturally, this is the number of threads in the block

    int X[4];    // grid dimensions                                                                                                                                                                                                           

This is the number of lattice sites on the MPI node

    static constexpr int nSpin = 4;
    static constexpr int nColor = 3;
    static constexpr bool spin_project = true;
    static constexpr bool spinor_direct_load = false; // false means texture load                                                                                                                                                             

These vales are hardcoded to prevent template explosion. This particular kernel will only work for spin=4, color=3 fermion fields. The spin_project boolean is a neat feature of QUDA that will automatically convert any gamma matrix ordered fermion into the desired gamma matrix order as it is loaded into the kernel. (KATE: do we need spinor_direct_load now that textures are eliminated?)

    // Create a typename F for the ColorSpinorField (F for fermion)                                                                                                                                                                           
    typedef typename colorspinor_mapper<real, nSpin, nColor, spin_project, spinor_direct_load>::type F;
    F x;
    F y;

This object is a QUDA data type that will handle the way in which fermions are passed to the kernel. As you can see, it needs to know the precision, color, and spin of the fermion, and well as any operations (such as spin_project) that need to be performed during the load. We give it a name F.

    matrix_field<complex<real>, nSpin> s;

This is a QUDA data type that will store the contraction. An nSpin X nSpin matrix of type complex<real>

We now come to the constructor of this argument structure:

ContractionArg(const ColorSpinorField &x, const ColorSpinorField &y, complex<real> *s) :
      threads(x.VolumeCB()),
      x(x),
      y(y),
      s(s, x.VolumeCB())
    {
      for (int dir = 0; dir < 4; dir++) X[dir] = x.X()[dir];
    }

This is what was called by the ContractionArg<real> arg(x, y, result); line. threads in is initialised with the checkerboarded sub-volume of a fermion field, the two fermion fields are, naturally, x, and y, and the result array is initialised with the address and the number of threads. The actual body of the argument structure is to then populate the X array with the MPI lattice dimensions.

This argument structure provides the kernel with the pointers it needs to locate data, and the parameters it needs to perform it's computation.

Instantiation of the kernel

Now that all the parameters and pointers have been established, we construct the kernel that will perform the computation:

Contraction<real, ContractionArg<real>> contraction(arg, x, y, cType);

This is a very subtle but powerful step. As you can see, the function named contraction is a Contraction type object, templetised on the specialised object ContractionArg. This allows us to Bake-In data types such as

typedef typename colorspinor_mapper<real, nSpin, nColor, spin_project, spinor_direct_load>::type F;

and parameter values

static constexpr int nSpin = 4;
static constexpr int nColor = 3;

which make life easier for the NVCC compiler. This is a common strategy utilised by the majority of QUDA kernels.

The next thing to look at is the Contraction<real, ContractionArg<real>> class itself. This is a fair chunk of abstract code to understand, so we will do so very carefully.

template <typename real, typename Arg> class Contraction : TunableVectorY
  {

The first thing to notice is that this class inherits from TunableVectorY. In fact, all of QUDA's kernel classes inherit from some form of Tunable class. [see tuning section]. This allows QUDA to automagically experiment with block and thread dimensions to discover the optimal set-up for a given problem. The protected members of the class:

protected:
    Arg &arg;
    const ColorSpinorField &x;
    const ColorSpinorField &y;
    const QudaContractType cType;

are the argument structure we have already seen, the memory addresses of the two fermion fields to be contracted, and the type of contraction to compute. The private members:

private:
    bool tuneGridDim() const { return false; } // Don't tune the grid dimensions.
    unsigned int minThreads() const { return arg.threads; }

are explicit definitions of helper functions defined virtually in the Tunable class. We must define them explicitly whenever we create a class that inherits from Tunable. In this particular case, we instruct the autotuner to not tune the grid dimensions, and we also request that there be a minimum number of threads, given by arg.threads. This value was set when we created the ContractionArg structure. It was inferred from the checkerboard volume of one of the lattice fermions. Make sure you can identify in the body of the ContractionArg code where this happened. Next is the all important public interface of the class. The first method is the constructor:

public:
    Contraction(Arg &arg, const ColorSpinorField &x, const ColorSpinorField &y, const QudaContractType cType) :
      TunableVectorY(2),
      arg(arg),
      x(x),
      y(y),
      cType(cType)
    {
      switch (cType) {
      case QUDA_CONTRACT_TYPE_OPEN: strcat(aux, "open,"); break;
      case QUDA_CONTRACT_TYPE_DR: strcat(aux, "degrand-rossi,"); break;
      default: errorQuda("Unexpected contraction type %d", cType);
      }
      strcat(aux, x.AuxString());
#ifdef JITIFY
      create_jitify_program("kernels/contraction.cuh");
#endif
    }

It is fairly standard for a constructor. We initialise the class from which we inherit TunableVectorY(2), then initialise the arg, vectors, and contraction type. The object aux is from the Tunable class. It is simply a string that defines the kernel for the autotuner. One the kernel has tuned, the autotuner will save the optimal set-up to a file and, if the same kernel (as defined by the aux string) is called again, the autotuner will use the optimal setup immediately. The part that is encapsulated by the JITIFY define is an instruction to the compiler that the kernel is to be compiled at run time [See section on JITIFY] The destructor is about as boiler plate as it gets:

    virtual ~Contraction() {}

Now we arrive at the kernel launch member function. This function is actually a member of the Tunable class. This is how the autotuner is able to repeatedly launch the kernel with various parameters. When we call the apply method we are not strictly asking the Contraction object to apply the kernel, rather we are asking the Tunable parent class to do it for us.

    void apply(const qudaStream_t &stream)
    {
      if (x.Location() == QUDA_CUDA_FIELD_LOCATION) {
        TuneParam tp = tuneLaunch(*this, getTuning(), getVerbosity());

QUDA is evolving so that the C++ code can be compiled by both CUDA and possibly a host compiler such as gcc. Therefore, we place the CUDA specific code under a boolean x.Location() == QUDA_CUDA_FIELD_LOCATION so we know we are using a GPU. We also define a TuneParam object that is used by the Tunable class.

#ifdef JITIFY
        std::string function_name;
        switch (cType) {
        case QUDA_CONTRACT_TYPE_OPEN: function_name = "quda::computeColorContraction"; break;
        case QUDA_CONTRACT_TYPE_DR: function_name = "quda::computeDegrandRossiContraction"; break;
        default: errorQuda("Unexpected contraction type %d", cType);
        }

        using namespace jitify::reflection;
        jitify_error = program->kernel(function_name)
                         .instantiate(Type<real>(), Type<Arg>())
                         .configure(tp.grid, tp.block, tp.shared_bytes, stream)
                         .launch(arg);
#else

This, again, is for when Just In Time compilation is used. [See section on JITIFY]

        switch (cType) {
        case QUDA_CONTRACT_TYPE_OPEN: computeColorContraction<real><<<tp.grid, tp.block, tp.shared_bytes>>>(arg); break;
        case QUDA_CONTRACT_TYPE_DR:
          computeDegrandRossiContraction<real><<<tp.grid, tp.block, tp.shared_bytes>>>(arg);
          break;
        default: errorQuda("Unexpected contraction type %d", cType);
        }

Here we now, at long last, see the familiar triple chevron <<<...>>> syntax of a CUDA kernel! Two in fact, one for each type of contraction. Each of these kernels contracts the fermions in slightly different ways. computeColorContraction will contract only the colour indices, giving 16 complex numbers per lattice site, one for each \mu, and \nu combination. computeDegrandRossiContraction will insert all 16 Gamma matrices into the contraction, and also giving 16 complex numbers, one for each unique gamma matrix combination. Here we also see how the autotuner works. The standard CUDA kernel arguments of blocks, threads, and shared memory size are controlled by the TuneParam tp object.

#endif
      } else {
        errorQuda("CPU not supported yet\n");
      }
    }

Lastly, we see that even though QUDA is moving towards CPU compilation, not all kernels are as yet up to spec. These last few methods are fairly self explanatory:

    TuneKey tuneKey() const { return TuneKey(x.VolString(), typeid(*this).name(), aux); }

    void preTune() {}
    void postTune() {}

    long long flops() const
    {
      if (cType == QUDA_CONTRACT_TYPE_OPEN)
        return 16 * 3 * 6ll * x.Volume();
      else
        return ((16 * 3 * 6ll) + (16 * (4 + 12))) * x.Volume();
    }

    long long bytes() const
    {
      return x.Bytes() + y.Bytes() + x.Nspin() * x.Nspin() * x.Volume() * sizeof(complex<real>);
    }
  };

The TuneKey tuneKey() objects simply defines the kernel and it's launch parameters, the rest are used for bookkeeping so that QUDA can properly report on performance.

So, in terms of the deceptively simple function:

template <typename real>
  void contract_quda(const ColorSpinorField &x, const ColorSpinorField &y, complex<real> *result,
                     const QudaContractType cType)
  {
    ContractionArg<real> arg(x, y, result);
    Contraction<real, ContractionArg<real>> contraction(arg, x, y, cType);
    contraction.apply(0);
    qudaDeviceSynchronize();
  }

have broadly covered the construction and application of the compute kernel. Let us now turn to the compute kernel itself to finish the story.

Kernel structure

There are two compute kernels in this class, we shall explore computeColorContraction as an example. The kernel is templetised on the data type (real can be float or double) and the ContractionArg which is given the typename Arg. This is done to tha, for example, one can create a different type of ContractionArg with say 4 colours and 1 spin.

template <typename real, typename Arg> __global__ void computeColorContraction(Arg arg)
  {
    int x_cb = threadIdx.x + blockIdx.x * blockDim.x;
    int parity = threadIdx.y + blockIdx.y * blockDim.y;

The above are some familiar CUDA kernel variables denoting the thread and block indices. The .x and .y aspects are purely a convenience. Lattice spacetime indices can be split into ODD and EVEN, which are independent of each other for certain operations. When accessing the fermion data, we refer to a lattice point x using the checkerboard index x_cb and its parity (EVEN or ODD).

if (x_cb >= arg.threads) return;

The streaming multiprocessors must launch warps of size 32. If the lattice dimensions are not evenly divided by 32, then some threads will have x_cb values that are larger than the lattice volume. If allowed to continue, these threads will attempt to access out-of-bounds memory, causing a segfault, hence the above (very important) line.

    constexpr int nSpin = Arg::nSpin;
    constexpr int nColor = Arg::nColor;
    typedef ColorSpinor<real, nColor, nSpin> Vector;

    Vector x = arg.x(x_cb, parity);
    Vector y = arg.y(x_cb, parity);

    Matrix<complex<real>, nSpin> A;

These are some variables and data types used in the computation. Notice two different ways in which data from the arg structure is accessed. static constexpr variables use the Arg:: method, while pointers to data use arg. typedef ColorSpinor<real, nColor, nSpin> Vector; is a convenience for a struct called ColorSpinor<real, nColor, nSpin>. As its name implies, it is a QUDA object that contains the nColor x nSpin complex components of precision real. This object is populated using the x and y fermion vectors cast as colorspinor_mapper objects defined in ContrcationArg. This is how we instruct the kernel to load data from global memory space. The object A is a QUDA object, an nSpin X nSpin matrix of complex. It will store our results.

#pragma unroll
    for (int mu = 0; mu < nSpin; mu++) {
#pragma unroll
      for (int nu = 0; nu < nSpin; nu++) {
        // Color inner product: <\phi(x)_{\mu} | \phi(y)_{\nu}>
        // The Bra is conjugated
        A(mu, nu) = innerProduct(x, y, mu, nu);
      }
    }

    arg.s.save(A, x_cb, parity);
  }

This is the body of the compute. We have now pulled in data from the two fermion fields x and y. We populate the A array using a function called innerProduct which is defined here. Once the loop is complete, we save the data to the output array, using a handy member function of the matrix_field object s defined in the argument structure. This will simply save the data in A at the point specified by x_cb and parity.

This completes the exposition of the pure fermion contraction kernel. You have seen how one can take raw host data, move it to the GPU, perform a computation, then move results back to the host.

⚠️ **GitHub.com Fallback** ⚠️