Performance Experiment: Graphs vs Streams vs Synchronous Kernels - rshipley160/learn-cuda GitHub Wiki

Previous: Graph node glossary

In this experiment, we are going to benchmark the performance of synchronous kernels, streams, and graphs on our quadratic equation solver workflow (introduced here) to get an idea of how much performance we can gain from asynchronous workflows like graphs and streams relative to a fully synchronous GPU workflow. In the end we will be comparing performance of a stream-based workflow using events, since that was found to be the most efficient form of stream-based execution in an earlier experiment, a graph created using stream capture, a graph created explicitly, and a fully synchronous workflow as a control.

To do this, we will start by copying the graph-building code for each type of graph into two separate functions, each of which accept parameters specifying the memory locations to use and the size of the problem. The first one is the explicit graph built in this tutorial.

cudaGraph_t buildQuadraticExpGraph(float * a, float *b, float *c, float *sol1, float *sol2, float *tmp, int numElements, int blockSize) {
    int gridSize = numElements / blockSize + 1;

    cudaGraph_t quadraticGraph;
    cudaGraphCreate(&quadraticGraph, 0);

    // Build the nodes that are part of the graph
    cudaGraphNode_t bSquaredNode;
    cudaKernelNodeParams bSquaredParams = {0};

    // ...

    //Complete graph and return
    cudaGraphAddKernelNode(&bMinusQuotientNode, quadraticGraph, nodeDependencies.data(), 1, &bMinusQuotientParams);

    return quadraticGraph;
}

In the stream capture version, we have to create and destroy the events used by the capture inside of the function, but otherwise it is the same code as in the stream capture graph tutorial

cudaGraph_t buildQuadraticCapGraph(float * a, float *b, float *c, float *sol1, float *sol2, float *tmp, int numElements, int blockSize) {
    int gridSize = numElements / blockSize + 1;

    cudaStream_t bMinus;
    cudaStream_t bPlus;
    cudaStreamCreate(&bMinus);
    cudaStreamCreate(&bPlus);

    cudaEvent_t bPlusComplete;
    cudaEvent_t bMinusComplete;
    cudaEventCreate(&bPlusComplete);
    cudaEventCreate(&bMinusComplete);

    cudaGraph_t quadraticGraph;
    cudaGraphCreate(&quadraticGraph, 0);

    cudaStreamBeginCapture(bMinus, cudaStreamCaptureModeGlobal);
        // Fork into bPlus to make stream capture record bPlus activity
        cudaEventRecord(bMinusComplete, bMinus);
        cudaStreamWaitEvent(bPlus, bMinusComplete);

        // Start graph activities
        elementwiseProduct<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(b, b, sol1, NUM_ELEMENTS);

        // Continue recording kernels...

        // Join the bPlus stream back into bMinus
        cudaEventRecord(bPlusComplete, bPlus);
        cudaStreamWaitEvent(bMinus, bPlusComplete);

    cudaStreamEndCapture(bMinus, &quadraticGraph);

    cudaEventDestroy(bPlusComplete);
    cudaEventDestroy(bMinusComplete);

    cudaStreamDestroy(bMinus);
    cudaStreamDestroy(bPlus);

    return quadraticGraph;
}

Next, let's build a simple function to time our graphs which will accept the number of times we want to launch the graph each time we record a new time, as well as the graph that should be instantiated and launched.

float timeGraph(cudaGraph_t graph, int iterations) {
    cudaEvent_t clockStart, clockStop;
    cudaEventCreate(&clockStart);
    cudaEventCreate(&clockStop);

    cudaGraphExec_t executable;
    
    cudaGraphInstantiate(&executable, graph, NULL, NULL, 0);

    cudaEventRecord(clockStart);
        for(int i=0; i<iterations; i++)
        cudaGraphLaunch(executable, 0);
    cudaEventRecord(clockStop);
    cudaEventSynchronize(clockStop);

    float timeElapsed;
    cudaEventElapsedTime(&timeElapsed, clockStart, clockStop);

    cudaEventDestroy(clockStart);
    cudaEventDestroy(clockStop);

    return timeElapsed;
}

Finally, we need to make two functions that will run and time our stream-based and synchronous versions of the quadratic solver. We won't show them here because of their length, but the code is nearly identical to the original event-based execution code, sans events and streams for the synchronous version. Then we just need a simple driver to run and print out the results of our timing tests.

const int NUM_ELEMENTS = 64;
    const int BLOCK_SIZE = 32;
    const int NUM_TRIALS = 20;
    const int ITERATIONS = 131072;

    int gridSize = NUM_ELEMENTS / BLOCK_SIZE + 1;

    float *a, *b, *c, *sol1, *sol2, *tmp;
    cudaMalloc(&a, sizeof(float)*NUM_ELEMENTS);
    cudaMalloc(&b, sizeof(float)*NUM_ELEMENTS);
    cudaMalloc(&c, sizeof(float)*NUM_ELEMENTS);
    cudaMalloc(&sol1, sizeof(float)*NUM_ELEMENTS);
    cudaMalloc(&sol2, sizeof(float)*NUM_ELEMENTS);
    cudaMalloc(&tmp, sizeof(float)*NUM_ELEMENTS);

    fillArray<<<gridSize, BLOCK_SIZE>>>(a, NUM_ELEMENTS);
    fillArray<<<gridSize, BLOCK_SIZE>>>(b, NUM_ELEMENTS);
    fillArray<<<gridSize, BLOCK_SIZE>>>(c, NUM_ELEMENTS);

    cudaGraph_t expGraph = buildQuadraticExpGraph(a, b, c,sol1, sol2, tmp, NUM_ELEMENTS, BLOCK_SIZE);
    cudaGraph_t capGraph = buildQuadraticCapGraph(a, b, c,sol1, sol2, tmp, NUM_ELEMENTS, BLOCK_SIZE); 

    printf("Synchronous,Stream,Captured Graph,Explicit Graph\n");
    for (int i=0; i<NUM_TRIALS; i++) {
        float syncTime = synchronousQuadratic(a, b, c, sol1, sol2, tmp, NUM_ELEMENTS, BLOCK_SIZE, ITERATIONS);
        printf("%.4f,", syncTime);
        float streamTime = quadraticUsingEvents(a, b, c, sol1, sol2, tmp, NUM_ELEMENTS, BLOCK_SIZE, ITERATIONS);
        printf("%.4f,", streamTime);
        float capGraphTime = timeGraph(capGraph, ITERATIONS);
        printf("%.4f,", capGraphTime);
        float expGraphTime = timeGraph(expGraph, ITERATIONS);
        printf("%.4f\n", expGraphTime);
    }

    cudaGraphDestroy(expGraph);
    cudaGraphDestroy(capGraph);

    cudaFree(a);
    cudaFree(b);
    cudaFree(c);
    cudaFree(tmp);
    cudaFree(sol1);
    cudaFree(sol2);

The full program is available here in case you need a refresher on any of the parts that were skipped over in the interest of brevity.

Analysis

As we have done for all of the previous experiments, we want to visualize the data rather than discuss raw numbers, and for this experiment, a box plot once again seemed to be the right answer since the timing results for each type of execution seem to have a bit of variance in them, and because the results for different types can't easily be compared to one another on the same graph for the most part.

After writing this script to process the experiment results, the result of which is the image above, we can start discussing what might be happening with this experiment.

Insights

First let's acknowledge the obvious: graphs came out on top, regardless of which way they are created. The stream captured version of the graph seems to give a more consistent execution time, which would make sense since the graph is computed automatically when using stream capture and thus is likely more optimized.

In second, (or possibly third, depending on how you count it) we have synchronous execution. How was it able to come in second? After all, it had to complete all of its tasks sequentially, while the other two have the benefit of concurrency to reduce their completion times.

The answer here likely lies in the same vein as the motivation for using graphs: kernel launch and synchronization overheads. While every synchronous kernel is launched separately and thus has its own overhead cost that is higher than when using graphs, it comes out on top of the multi-stream version because that requires even more API calls to ensure that the algorithm is executed properly. Without using a graph to mitigate those overheads, they quickly add up so that the event-synchronized version becomes more costly than doing it all synchronously.

Next: Performance Experiment: Graph Length