Creating graphs with stream capture - rshipley160/learn-cuda GitHub Wiki

Previous: CUDA Graphs

Since we've now covered the basics of what a graph is and the motivation for using a graph over other asynchronous methods, let's take a look at the first way we can actually create and launch a graph in CUDA: stream capture.

The stream capture method was implemented so that existing workflows could be converted to graphs with minimal effort on the programmer's part. The main idea is that we can tell CUDA to capture the work a stream is doing, (rather than actually doing the work) and at the end of the capture CUDA will transform the captured workflow into a fully functional graph.

Graph Creation

To get started, we'll copy the quadratic_solver program from the Events and Dependencies tutorial into a new file that will become a graph implementation of the quadratic formula solver that we created in that article.

The first thing we'll need to add is the graph object itself. This follows mostly the same pattern as stream and event creation.

cudaGraph_t quadraticGraph;
cudaGraphCreate(&quadraticGraph, 0);

cudaGraphExec_t graphExecutable;

We also went ahead and defined the executable graph that will be used to evaluate our quadratic solver. The graph and executable graph are analogous to source code and compiled code; the graph object serves to define the workflow and can be easily modified to fit its needs, while the executable graph can only execute the same workflow once it has been built, and is able to perform optimizations based on the workflow in much the same way that a compiler is.
This "compilation" step of the executable graph can only happen once the entire graph has been defined - after all, you can only compile your source code after it's been written, not before.

Stream Capture

Next we need to capture the contents of our workflow for the graph, which we can do by using the graph API to begin and end the stream capture.

    cudaStreamBeginCapture(bMinus, cudaStreamCaptureModeGlobal);
        elementwiseProduct<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(b, b, sol1, NUM_ELEMENTS);
        elementscalarProduct<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(a, c, -4, sol2, NUM_ELEMENTS);
        
        //...

        elementwiseQuotient<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(sol1, a, 0.5, sol1, NUM_ELEMENTS);
        elementwiseQuotient<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(sol2, a, 0.5, sol2, NUM_ELEMENTS);
    cudaStreamEndCapture(bMinus, &quadraticGraph);

In case you don't recall from the events tutorial, bMinus and bPlus are two streams that are being used to solve the quadratic equation. The second parameter to cudaStreamBeginCapture, cudaStreamCaptureModeGlobal, is one of three possible capture modes that determine the programming paradigm being used for the workflow. Other options include stream-per-thread capture, which ties into the larger per-thread stream model introduced in CUDA 7.0, and a "relaxed" mode which allows for potentially unsafe calls to the CUDA API that may be required for some applications.
For our purposes, global capture mode is sufficient because we are not using the per-thread default stream model. When we end the capture, we once again supply the stream that we want to capture on in addition to the pointer of our graph object. This allows the graph to be updated to reflect the captured workflow.

Capturing Multiple Streams in a Graph

If we were converting a single-stream workflow into a graph, we would already be done - the work done by the stream would be directly captured and transformed into a graph that can then be "compiled" into an executable workflow. However, because we have multiple streams, we have to somehow signal to the stream capture that it needs to capture multiple streams, and we also have to ensure that all of the streams captured are able to complete before the capture ends.

Currently, only bMinus is being captured, so to make the stream capture aware that bMinus interacts bPlus, we are going to record an event on the bMinus stream that we will then have the bPlus stream wait on. This process of "forking" bMinus into bPlus is crucial to ensuring that the whole series of events is captured accurately.

    // Fork into bPlus to make stream capture record bPlus activity
    cudaEventRecord(bMinusComplete, bMinus);
    cudaStreamWaitEvent(bPlus, bMinusComplete);

    // Start graph activities
    elementwiseProduct<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(b, b, sol1, NUM_ELEMENTS);
    elementScalarProduct<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(a, c, -4, sol2, NUM_ELEMENTS);

If we did not add these lines that establishes the dependency between bPlus and bMinus before the first kernel in our quadratic solver, the kernels on the bPlus stream would simply run normally when they are encountered instead of being captured as part of the graph.

Similarly, the stream capture also looks for the latest dependency bMinus has on bPlus so that it can stop recording bPlus activity as soon as possible, so to ensure all of bPlus's activities are captured, we have bMinus wait on bPlus's complete event before ending the stream capture.

    elementwiseQuotient<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(sol1, a, 0.5, sol1, NUM_ELEMENTS);
    elementwiseQuotient<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(sol2, a, 0.5, sol2, NUM_ELEMENTS);
    cudaEventRecord(bPlusComplete, bPlus);
    cudaStreamWaitEvent(bMinus, bPlusComplete);

    cudaStreamEndCapture(bMinus, &quadraticGraph);

This ensures that all of the tasks given to bPlus are captured within the stream. If we were to remove the beginning or ending event synchronization calls, then some of bPlus's tasks would not get captured and instead would just be run normally at the same time bMinus is being captured, meaning that the graph doesn't actually execute the entire workflow.

Creating a Graph Executable

After we have captured the workflow that our graph should contain, we can "compile" our graph template into an executable version via

cudaGraphInstantiate(&graphExecutable, quadraticGraph, NULL, NULL, 0);

As you will probably recognize, the first two parameters are our executable graph object and the graph template. The remaining three parameters are reserved for a graph error node, a diagnostic character buffer, and the size of that buffer. To keep this example simple, those elements have been omitted, but in a production environment it is always recommended to include them. You can find out more about them from the cudaGraphInstantiate documentation.

Launching a Graph

With our executable graph in hand, we can now launch this workflow a limitless amount of times. Executable graphs are not limited to running on the streams that were part of the initial capture. In fact, we are going to destroy the original bPlus and bMinus streams just to ensure that they no longer exist, and then we'll launch our graph on a brand new stream.

cudaStreamDestroy(bMinus);
cudaStreamDestroy(bPlus);

cudaStream_t newStream;
cudaStreamCreate(&newStream);

cudaGraphLaunch(graphExecutable, newStream);

Finally, we must clean up our graph objects just like a stream or any other CUDA type, using their respective destroy methods. Note that we synchronize the primary stream the graph is operating on before destroying the executable graph to ensure that the graph has completed its work first.

    cudaGraphDestroy(quadraticGraph);
    
    cudaStreamSynchronize(newStream);
    cudaGraphExecDestroy(graphExecutable);

You can check out the full example in the repository, otherwise this tutorial is complete.

Next: Creating graphs explicitly