Home - Tibalt/learn-cuda GitHub Wiki

Welcome to the learn-cuda wiki!

A normal pipleline in jetson applications:

move one image inside CPU memory from somewhere to GPU
a kernel to process the image;
move the image from GPU to CPU.

In real application, the pipeline will move on and on. We will try to improve the performance to handle more images per second from camera. Tricks will be applied to the "original" code and tested on jetson Nano and xavier. The old fashion(clock_gettime) way, event in CUDA and nvprof are used as the profiling tools. As clock_gettime and "event" in CUDA actually have the same number, we only record one set of numbers. nvprof is quite different so that we have to have a separate part to analysis the result. Let's test main.cu. Get the binary by "make" and "make pin".

Xavier(millisecond)

	normal 1 stream	normal streamed	pinned 1 stream	pinnned multiple stream
1
2
3
4
5
6
7
8
9
10

Nano (millisecond)

	normal multiple stream	normal 1 streame	pinned multiple stream	pinnned 1 stream
1	1467	2318	438	1107
2	1386	3049	454	774
3	1399	2408	387	886
4	1337	2475	447	873
5	1675	2696	450	852
6	1481	2590	426	942
7	1557	2389	507	987
8	1630	3120	418	775
9	1466	2492	441	890
10	1682	2658	417	784

Let's check the profile result:

As you see, pinned 1,2,3 items are 1/3 of non-pinned! Here we do not count the stream parallel improvement.

Another comparasion:

As you see, binary with streams is much faster regarding to memcpy and calculation.