Performance Analysis - yszheda/wiki GitHub Wiki
Roofline Performance Model
-
https://crd.lbl.gov/departments/computer-science/par/research/roofline/introduction/
-
[一周一paper][ISCA] In-Datacenter Performance Analysis of a Tensor Processing Unit
-
Accelerating HPC Applications with NVIDIA Nsight Compute Roofline Analysis
Profiling
-
https://en.wikipedia.org/wiki/Profiling_(computer_programming)
-
https://www.jetbrains.com/help/profiler/Profiling_Guidelines__Choosing_the_Right_Profiling_Mode.html
Profilers
-
https://software.intel.com/content/www/us/en/develop/articles/intel-performance-counter-monitor.html
Latency
- Everything You Know About Latency Is Wrong
- Why Averages Suck and Percentiles are Great
- Who moved my 99th percentile latency?
- https://stackoverflow.com/questions/12808934/what-is-p99-latency
Clock Cycle
TMA (Top-down Microarchitecture Analysis) Method
-
- Pipeline Slots
-
- 对于CPU中的前端和后端划分,一般将micro-op Queue往后的部分划分为后端,micro-op Queue模块及其之前的部分为前端。micro-op Queue模块往后的带宽,便是CPU的发射宽度。
-
- Frontend Latency
- ICache Misses
- ITLB Misses
- Branch Resteers
- DSB Switches
- LCP
- MS Switches
- Frontend Latency
-
- Decoded Stream Buffer (DSB):存放已经被decode的uops
-
- Length Changing Prefixes (LCP)
-
- Microcode Sequencer (MS)
-
- Frontend Bandwidth
- Micro-instruction Translation Engine (MITE)
- Decoded Stream Buffer (DSB)
- Loop Stream Detector (LSD)
- Frontend Bandwidth
- Branch Mispredicts
- Machine Clears: e.g. memory ordering violations, self-modifying code, load illegal address ranges
-
- 现代x86 CPU内部会将CISC指令转换成RISC指令来执行,变成一条条uops,但是有些复杂的CISC指令,并不会通过Frontend中的Decoder来得到相应的uops,而是通过Microcode Sequencer (MS) unit来生成相应的uops指令。
- Base
- Microcode Sequencer:MS用于解析默认decoders不支持的CISC指令,比如对string重复执行move操作的指令,CPUID指令等,这些类型的指令均由MS生成。