DNN Accelerator - newlife-js/Wiki GitHub Wiki

by 서울대학교 이재욱 교수님

Performance Analysis

Performance Metrics

Time

Wall-clock time, response time, elapsed time: time from start to completion
CPU time: spent time for given program
= seconds/program = (cycles/program) x (seconds/cycle)
= (instructions/program) x (cycles x instruction) x (seconds/cycle)
-> Iron Law of CPU Performance
CPI(Cycles per instruction)
Factors involve in the CPU Time

ISA: RISC(명령어 세팅 개수가 적음) vs CISC(복잡한 명령어 세팅)

Rate(throughput)

MIPS(million instructions per second)
MIPS가 클수록 빠른 CPU
MFLOPS(million floating-point operations per second)

Mean

Arithmetic mean(performance in times)
∑Execution Times / n
각 benchmark가 같은 횟수만큼 실행된다고 가정
-> Weighted Arithmetic Mean: ∑W x Execution Times / ∑W
Harmonic Mean(performance in rates)
n / ∑(1/R) -> Weighted Harmonic Mean: ∑W / ∑ (W/R)

Roofline Model

Performance of accelerators

Computation

Floating point performance(GFLOP/sec) is the main target

peak performance: ample parallelism, balanced combination of floating point inst., rare branch mispredictions, no thread divergence

Communication

DRAM bandwidth(GB/sec) is the main concern

peak bandwidth: few unit stride access streams, locality-aware NUMA allocation and usage, SW prefetching, memory coalescing

Locality

Maximize cache locality to minimize communication overhead
hardware change: larger cache capacity(capacity miss), higher cache associativity(conflict miss), non-allocating caches(compulsory traffic)
sw optimization: padding avoids conflict misses, blocking avoids capacity misses, non-allocating stores minimize compulsory traffic

Arithmetic Intensity(AI)

Total floating point operations / Total DRAM bytes accesses (FLOP/bytes)
determined by on-chip memory hierarchy, algorithm

Roofline model

ridge point 부근이 ideal region
program optimization의 가이드라인을 제시해줌
compute bound인지, memory bound인지 파악을 해서 hw 개선이나 sw optimization을 할 수 있음

Benchmark Metrics

Metrics of DNN algorithms

Accuracy
Network Architecture: # of layers, filters, channels, filter size
'# of weights(storage capacity)
'# of MACs(operation)

Metrics for DNN hardware

Energy efficiency
External memory bandwidth
Area efficiency

MLPerf Benchmark

A broad ML benchmark suite for measuring performance of ML frameworks, ML hardware accelerators, and ML cloud platforms
real-world data에 대해 time, accuracy, cost를 공정하게 비교할 수 있도록 open-source implementation 제공

DNN Accelerator Architectures

Technology Trends

Moore의 법칙은 이미 깨짐
CPU to xPU: 특정 연산을 위한 하드웨어 가속기의 유용성 ↑(google TPU), 특정 domain에 맞도록 parallelism, memory bandwidth 등을 최적화
Rise of AI, Big Data: compute/memory-intensive, parameter size, FLOP, training cost, memory capacity and bandwidth ↑
Platform war(HW + SW): PyTorch/TensorFlow/MxNet, M1+macOS, Nvidia+Cuda, TPU+Tensorflow

GPU(NVIDIA)

Diffuse Shader나 3D modeling을 위해서 만들어진 processor
data parallelism을 위해 multi-core 사용(Fetch와 Decode 과정은 여러 core가 공유하도록)
DNN을 위해서 Tensorflow 연산이나 Matrix multiplication에 최적화된 모델이 나옴

Hard DPU(Google TPU)
Soft DPU(Microsoft BrainWave): FPGA 기반 DNN serving platform

※ Self attention: 매우 큰 연산 오버헤드를 가짐(연산의 30% 이상)
-> ELSA: self-attention HW Accelerator IP(self-attention 연산만을 가속해주는 design)

Challenges in AI Computing Platform Design

Cost of Data Movement: energy는 연산보다 data movement에서 대부분 사용됨 '
-> processing-in-memory(PIM), in-storage processing(ISP)
Memory/Storage Bandwidth and Capacity
Addressing Bandwidth/Capacity Bottleneck

Hard DPUs

High-parallel compute paradigms

Memory Access is the bottleneck

매 연산마다 DRAM에 접근하는 것은 비효율적
-> memory hierarchy를 이용해 data 재사용을 함

Convolutional Reuse(하나의 filter로 sliding window하는 경우): filter weight와 activation 재사용
Fmap Reuse(여러 filter를 동일한 data에 대해 사용하는 경우): activation 재사용
Filter Reuse(여러 batch에 대해 동일한 filter 적용하는 경우): filter weight 재사용

Google TPU(Tensor Processing Unit)

TPU v1

Inference만 수행
server에 PCIe를 통해 연결(matrix accelerator on I/O bus)
host server가 instructions을 전달(like a floating point unit)

※ Systolic Execution: energy/time efficiency를 위해 control 과 data를 pipeline으로 구성
동일한 weight에 대해서 연산을 할 때 병렬적으로 계산

TPU v2

Training >> Inference
forward propagation, backward propagation, weight update 모두 해야하기 때문에 tensor의 lifetime이 긺

Soft DPU

Microsoft BrainWave

scalable FPGA-powered DNN serving platform
Fast(ultra-low latency, high-throughput), Flexible(adaptive numerical precision, custom operators), Friendly(turnkey deployment of CNTK/Caffe/TensorFlow)
SW(Compiler and scheduler) plays key roles to achieve high utilization