Home - Tibalt/learn-cuda GitHub Wiki

Welcome to the learn-cuda wiki!

A normal pipleline in jetson applications:

  1. move one image inside CPU memory from somewhere to GPU
  2. a kernel to process the image;
  3. move the image from GPU to CPU.

In real application, the pipeline will move on and on. We will try to improve the performance to handle more images per second from camera. Tricks will be applied to the "original" code and tested on jetson Nano and xavier. The old fashion(clock_gettime) way, event in CUDA and nvprof are used as the profiling tools. As clock_gettime and "event" in CUDA actually have the same number, we only record one set of numbers. nvprof is quite different so that we have to have a separate part to analysis the result. Let's test main.cu. Get the binary by "make" and "make pin".

Xavier(millisecond)

normal 1 stream normal streamed pinned 1 stream pinnned multiple stream
1
2
3
4
5
6
7
8
9
10

Nano (millisecond)

normal multiple stream normal 1 streame pinned multiple stream pinnned 1 stream
1 1467 2318 438 1107
2 1386 3049 454 774
3 1399 2408 387 886
4 1337 2475 447 873
5 1675 2696 450 852
6 1481 2590 426 942
7 1557 2389 507 987
8 1630 3120 418 775
9 1466 2492 441 890
10 1682 2658 417 784

Let's check the profile result:

image image

As you see, pinned 1,2,3 items are 1/3 of non-pinned! Here we do not count the stream parallel improvement.

Another comparasion: image image

As you see, binary with streams is much faster regarding to memcpy and calculation.