Home - Tibalt/learn-cuda GitHub Wiki
Welcome to the learn-cuda wiki!
A normal pipleline in jetson applications:
- move one image inside CPU memory from somewhere to GPU
- a kernel to process the image;
- move the image from GPU to CPU.
In real application, the pipeline will move on and on. We will try to improve the performance to handle more images per second from camera. Tricks will be applied to the "original" code and tested on jetson Nano and xavier. The old fashion(clock_gettime) way, event in CUDA and nvprof are used as the profiling tools. As clock_gettime and "event" in CUDA actually have the same number, we only record one set of numbers. nvprof is quite different so that we have to have a separate part to analysis the result. Let's test main.cu. Get the binary by "make" and "make pin".
Xavier(millisecond)
normal 1 stream | normal streamed | pinned 1 stream | pinnned multiple stream | |
---|---|---|---|---|
1 | ||||
2 | ||||
3 | ||||
4 | ||||
5 | ||||
6 | ||||
7 | ||||
8 | ||||
9 | ||||
10 |
Nano (millisecond)
normal multiple stream | normal 1 streame | pinned multiple stream | pinnned 1 stream | |
---|---|---|---|---|
1 | 1467 | 2318 | 438 | 1107 |
2 | 1386 | 3049 | 454 | 774 |
3 | 1399 | 2408 | 387 | 886 |
4 | 1337 | 2475 | 447 | 873 |
5 | 1675 | 2696 | 450 | 852 |
6 | 1481 | 2590 | 426 | 942 |
7 | 1557 | 2389 | 507 | 987 |
8 | 1630 | 3120 | 418 | 775 |
9 | 1466 | 2492 | 441 | 890 |
10 | 1682 | 2658 | 417 | 784 |
Let's check the profile result:
As you see, pinned 1,2,3 items are 1/3 of non-pinned! Here we do not count the stream parallel improvement.
Another comparasion:
As you see, binary with streams is much faster regarding to memcpy and calculation.