Threadblock sizing, occupancy, ILP and TLP - OrangeOwlSolutions/General-CUDA-programming GitHub Wiki

Latency, throughput and occupancy

Latency: the time required to complete an operation

Typical latencies are ≈20 cycles for arithmetic; 400+ cycles for memory

Throughput: number of operations that complete per unit time

Occupancy: ratio of active warps to the maximum number of warps supported on a streaming multiprocessor

Giving the hardware a proper amount of work to do by a careful choice of the threadblock size helps hiding the latency and improving the throughput. Typically it is required to maximize the occupancy.

Pizza delivery example

Suppose that you own and are the only employee of a pizzeria. In the GPU language, the pizzeria is served by a single thread (yourself). The single pizzeria thread does the following operations:

  1. he receives the order

![Pizza_order](http://www.crustandcrumble.com/s/cc_images/cache_946189525.jpg?t=1447296741 | width=100); 2. he prepares the pizza; 3. he delivers the pizza; 4. he is ready to receive the next order;