Threadblock sizing, occupancy, ILP and TLP - OrangeOwlSolutions/General-CUDA-programming GitHub Wiki
Latency, throughput and occupancy
Latency: the time required to complete an operation
Typical latencies are ≈20
cycles for arithmetic; 400+
cycles for memory
Throughput: number of operations that complete per unit time
Occupancy: ratio of active warps to the maximum number of warps supported on a streaming multiprocessor
Giving the hardware a proper amount of work to do by a careful choice of the threadblock size helps hiding the latency and improving the throughput. Typically it is required to maximize the occupancy.
Pizza delivery example
Suppose that you own and are the only employee of a pizzeria. In the GPU language, the pizzeria is served by a single thread (yourself). The single pizzeria thread does the following operations:
- he receives the order
![Pizza_order](http://www.crustandcrumble.com/s/cc_images/cache_946189525.jpg?t=1447296741 | width=100); 2. he prepares the pizza; 3. he delivers the pizza; 4. he is ready to receive the next order;