Creating graphs explicitly - rshipley160/learn-cuda GitHub Wiki

Previous: Creating graphs with stream capture

In the last article we built our first graph using the stream capture process created by NVIDIA to allow developers to quickly transition their stream-based workflows to graphs via the stream capture process. In doing so, we discovered a couple of the quirks of the process and learned that, while graphs can be built by capturing the stream workflow, one must be very careful about how that workflow is laid out in order for it to be recorded properly.

So what do we do if we don't already have an existing workflow to capture, or want finer-grained control over the structure and contents of our graph?
In these cases, we can use the CUDA Graph portion of the runtime API to define and build our graph from the ground up. To do this, we will create each node in the graph, which represents a single instance of a certain type of operation - kernels, memory copies, host function calls, events, etc. Almost every operation which can be done in a stream can be done in a graph.

Creating Nodes

The creation process for every node that can be added to a graph looks relatively the same. Let's take a look at the steps required to create the b squared note for what will become the third iteration of our quadratic formula solver, this time built using the explicit graph API.
The b squared node creation process looks like this:

    cudaGraphNode_t bSquaredNode;
    cudaKernelNodeParams bSquaredParams = {0};
    bSquaredParams.blockDim = BLOCK_SIZE;
    bSquaredParams.gridDim = gridSize;
    bSquaredParams.func = (void *)elementwiseProduct;
    void *bSquaredfunc_params[4] = {(void *)&b, (void *)&b, (void *)&sol1, (void *) &NUM_ELEMENTS};
    bSquaredParams.kernelParams = (void **)bSquaredfunc_params;

Taking this a chunk at a time, we first have

cudaGraphNode_t bSquaredNode;
cudaKernelNodeParams bSquaredParams = {0};

This declares the node itself, which is the object that will be added to the graph, and also declares and initializes the parameter struct that defines the node's properties.
Every node, regardless of its type will have the data type cudaGraphNode_t, however, every type of node has a different parameter struct with different member values. These parameter structs are documented in the CUDA Graph Node Glossary and in the Graph Management section of the CUDA Runtime API linked above.

    bSquaredParams.blockDim = BLOCK_SIZE;
    bSquaredParams.gridDim = gridSize;
    bSquaredParams.func = (void *)elementwiseProduct;
    void *bSquaredfunc_params[4] = {(void *)&b, (void *)&b, (void *)&sol1, (void *) &NUM_ELEMENTS};
    bSquaredParams.kernelParams = (void **)bSquaredfunc_params;

After initializing the params struct, we then fill it. Here, we are just setting the grid and block sizes as well as the kernel that we actually want the node to run. This is analogous to the kernel launches we used in the last two versions of this algorithm

elementwiseProduct<<<gridSize, BLOCK_SIZE>>>(b, b, sol1, NUM_ELEMENTS);

You might be asking why there are so many void pointers in this segment of code, and the answer is that void pointers simply allow the greatest flexibility to the graph API so that it can run virtually any kernel you throw at it. The void pointer essentially temporarily removes type information that would cause errors so that the function and its parameters can be passed as parameters to the node without issue.

As an aside, I would like to explicitly emphasize that the kernel you wish to run must be cast to a void pointer before its value can be assigned to the .func member of the parameter struct, and similarly, the kernel parameters must be put into a list of void pointers, which consists of the reference values, or lvalues, of all of the kernel parameters cast to void pointers. The list of parameters must then be cast to a void ** which allows for the graph to accept kernel parameter lists of varying length.

Following the same process, we can create the seven other nodes that are needed to implement the quadratic solver algorithm. It should be noted that, if the same exact node were needed twice, with all of the parameters exactly the same, then the same node definition can be added to the graph twice and there is no need to create another node type with the same information.

Unfortunately for us though, all of our nodes are slightly different, so we will need a total of 8 different node definitions.

    cudaGraphNode_t bSquaredNode;
    cudaKernelNodeParams bSquaredParams = {0};
    bSquaredParams.blockDim = BLOCK_SIZE;
    bSquaredParams.gridDim = gridSize;
    bSquaredParams.func = (void *)elementwiseProduct;
    void *bSquaredfunc_params[4] = {(void *)&b, (void *)&b, (void *)&sol1, (void *) &NUM_ELEMENTS};
    bSquaredParams.kernelParams = (void **)bSquaredfunc_params;

    cudaGraphNode_t neg4acNode;
    cudaKernelNodeParams neg4acParams = {0};
    neg4acParams.blockDim = BLOCK_SIZE;
    neg4acParams.gridDim = gridSize;
    neg4acParams.func = (void *)elementScalarProduct;
    const float NEG_FOUR = -4.0;
    void *neg4acfunc_params[5] = {(void *)&a, (void *)&c, (void *)&NEG_FOUR, (void *)&sol2, (void *) &NUM_ELEMENTS};
    neg4acParams.kernelParams = (void **)neg4acfunc_params;

    cudaGraphNode_t determinantSumNode;
    cudaKernelNodeParams determinantSumParams = {0};
    determinantSumParams.blockDim = BLOCK_SIZE;
    determinantSumParams.gridDim = gridSize;
    determinantSumParams.func = (void *)elementwiseSum;
    void *determinantSumfunc_params[4] = {(void *)&sol1, (void *)&sol2, (void *)&sol1, (void *) &NUM_ELEMENTS};
    determinantSumParams.kernelParams = (void **)determinantSumfunc_params;

    cudaGraphNode_t determinantSqrtNode;
    cudaKernelNodeParams determinantSqrtParams = {0};
    determinantSqrtParams.blockDim = BLOCK_SIZE;
    determinantSqrtParams.gridDim = gridSize;
    determinantSqrtParams.func = (void *)elementwiseSqrt;
    void *determinantSqrtfunc_params[3] = {(void *)&sol1, (void *)&tmp, (void *) &NUM_ELEMENTS};
    determinantSqrtParams.kernelParams = (void **)determinantSqrtfunc_params;
    
    cudaGraphNode_t bPlusNode;
    cudaKernelNodeParams bPlusParams = {0};
    bPlusParams.blockDim = BLOCK_SIZE;
    bPlusParams.gridDim = gridSize;
    bPlusParams.func = (void *)elementwiseSum;
    void *bPlusfunc_params[4] = {(void *)&tmp, (void *)&b, (void *)&sol1, (void *) &NUM_ELEMENTS};
    bPlusParams.kernelParams = (void **)bPlusfunc_params;
    
    cudaGraphNode_t bMinusNode;
    cudaKernelNodeParams bMinusParams = {0};
    bMinusParams.blockDim = BLOCK_SIZE;
    bMinusParams.gridDim = gridSize;
    bMinusParams.func = (void *)elementwiseDifference;
    void *bMinusfunc_params[4] = {(void *)&b, (void *)&tmp,  (void *)&sol2, (void *) &NUM_ELEMENTS};
    bMinusParams.kernelParams = (void **)bMinusfunc_params;

    cudaGraphNode_t bPlusQuotientNode;
    cudaKernelNodeParams bPlusQuotientParams = {0};
    bPlusQuotientParams.blockDim = BLOCK_SIZE;
    bPlusQuotientParams.gridDim = gridSize;
    bPlusQuotientParams.func = (void *)elementwiseQuotient;
    const float ONE_HALF = 0.5;
    void *bPlusQuotientfunc_params[5] = {(void *)&sol1, (void *)&a, (void *)&ONE_HALF, (void *)&sol1, (void *) &NUM_ELEMENTS};
    bPlusQuotientParams.kernelParams = (void **)bPlusQuotientfunc_params;

    cudaGraphNode_t bMinusQuotientNode;
    cudaKernelNodeParams bMinusQuotientParams = {0};
    bMinusQuotientParams.blockDim = BLOCK_SIZE;
    bMinusQuotientParams.gridDim = gridSize;
    bMinusQuotientParams.func = (void *)elementwiseQuotient;
    void *bMinusQuotientfunc_params[5] = {(void *)&sol2, (void *)&a, (void *)&ONE_HALF, (void *)&sol2, (void *) &NUM_ELEMENTS};
    bMinusQuotientParams.kernelParams = (void **)bMinusQuotientfunc_params;

Assembling the Graph

Now that all of our nodes have been created, as well as their parameter data structures, we can go about adding them to the graph in the order that they should be executed. First we'll add the initial concurrent operations of calculating b squared and -4ac.

    cudaGraphAddKernelNode(&bSquaredNode, quadraticGraph, NULL, 0, &bSquaredParams);

    cudaGraphAddKernelNode(&neg4acNode, quadraticGraph, NULL, 0, &neg4acParams);

As you can see, to add the node to the graph, we use the cudaGraphAddKernelNode function, which is passed a total of 5 parameters:

the address of the node we want to add
the graph we are adding it to
the nodes that this node is dependent on
the number of dependencies
the parameters for the kernel to be run

These two functions both have NULL and 0 for their parent nodes and number of parent nodes, respectively. This tells the CUDA compiler that both of these nodes are at the top of the graph, (or are the root nodes) and that they are both free to run at the same time as soon as the graph begins executing.

Tracking Dependencies in Graphs

With our two root nodes added, lets add the next node, which is supposed to add the two prior values together. It will look a bit like this:

    cudaGraphAddKernelNode(&determinantSumNode, quadraticGraph, ???, 2, &determinantSumParams);

Ignoring the elephant in the code, you can see we once again provide the add kernel node function with a node, the graph, and the kernel parameters, as well as the number 2 so that it knows that the determinant sum node is dependent upon two prior nodes in the graph. But how do we tell it which two nodes it is dependent on?

Well, if it were only dependent on one node, we could simply pass that node as the third parameter and be done, but as there are two, we have to be a bit more creative.
To be able to add more than one node as a dependency, we need a single variable that can hold multiple items, and since different nodes have differing dependencies and amounts of dependencies, it would be good if it could be variable length as well, so we will be using a vector to hold our node dependencies.

If we revise the graph assembly code we have so far to include the dependencies vector, it will look something like this:

    std::vector<cudaGraphNode_t> nodeDependencies;

    cudaGraphAddKernelNode(&bSquaredNode, quadraticGraph, NULL, 0, &bSquaredParams);
    nodeDependencies.push_back(bSquaredNode);

    cudaGraphAddKernelNode(&neg4acNode, quadraticGraph, NULL, 0, &neg4acParams);
    nodeDependencies.push_back(neg4acNode);

    cudaGraphAddKernelNode(&determinantSumNode, quadraticGraph, nodeDependencies.data(), 2, &determinantSumParams);

    nodeDependencies.clear();
    nodeDependencies.push_back(determinantSumNode);

So we initialize our vector of dependencies, and then add nodes as normal, except we push_back each dependency onto the vector so it can be retrieved later on. This allows us to then add our determinant sum node using the list of nodes provided by nodeDependencies.data(). After we have added the determinant sum we no longer need to track the b squared and 4ac dependencies, because everything else will go from the determinant sum forward, so we clear our dependencies list and restart it with just the determinant sum node as the only dependency.
We can continue following this pattern to build the remainder of the graph

    cudaGraphAddKernelNode(&determinantSqrtNode, quadraticGraph, nodeDependencies.data(), 1, &determinantSqrtParams);

    nodeDependencies.clear();
    nodeDependencies.push_back(determinantSqrtNode);


    cudaGraphAddKernelNode(&bPlusNode, quadraticGraph, nodeDependencies.data(), 1, &bPlusParams);
    cudaGraphAddKernelNode(&bMinusNode, quadraticGraph, nodeDependencies.data(), 1, &bMinusParams);

    nodeDependencies.clear();
    nodeDependencies.push_back(bPlusNode);

    cudaGraphAddKernelNode(&bPlusQuotientNode, quadraticGraph, nodeDependencies.data(), 1, &bPlusQuotientParams);

    nodeDependencies.clear();
    nodeDependencies.push_back(bMinusNode);

    cudaGraphAddKernelNode(&bMinusQuotientNode, quadraticGraph, nodeDependencies.data(), 1, &bMinusQuotientParams);

There are two things to note here: the first is that bPlusNode and bMinusNode are both added without altering the dependency chain. This establishes that they are both dependent on the same prior node and can both start working concurrently as soon as the node they are dependent on finishes.
The other thing to note is that the dependencies are cleared and altered between the additions of bPlusQuotientNode and bMinusQuotientNode. This is important because it maintains the separation of dependencies and thus the concurrency of the two quotient nodes. If the dependencies were not cleared and both quotients were dependent upon both the addition and subtraction operations, the concurrency we established in our initial algorithm would be lost.

Running the Graph

Running a graph built using the exposed API is no different from running a graph that is made using stream capture; all that is needed is the compilation and execution of the graph, followed by synchronization of the stream running the graph before destroying it. Here is the rest of our explicit graph program:

    cudaGraphExec_t graphExecutable;
    cudaGraphInstantiate(&graphExecutable, quadraticGraph, NULL, NULL, 0);

    cudaGraphLaunch(graphExecutable, 0);

    cudaStreamSynchronize(0);

    cudaGraphDestroy(quadraticGraph);
    cudaGraphExecDestroy(graphExecutable);

    cudaFree(a);
    cudaFree(b);
    cudaFree(c);
    cudaFree(tmp);
    cudaFree(sol1);
    cudaFree(sol2);

Now that you have built your first graphs using both stream capture and the explicit API, you are ready to move on to more advanced graph knowledge and start implementing your algorithms efficiently using graphs.

Next: Using host functions with graphs and streams