[논문리뷰] EfficientDet : Scalable and Efficient Object Detection - penny4860/study-note GitHub Wiki

1. 정리

요약

retinanet을 기본구조로 다음과 같은 내용을 개선
개선 내용
- Backbone
  - efficientnet을 사용
- Feature Fusion 방식 : BiFPN
- Compound Scaling 사용
  - Backbone
  - input resolution
  - FPN의 width/depth
  - subnet (cls, reg) 의 width/depth

질문

feature fusion할때 weight는 scalar인가?
- scalar임.

2. 내용

1. Introduction

기존 연구들
- Accurate detector : 매우 느림
  - Amobanet-based FPN detector
- Efficient detector : 효율적이지만 정확도가 떨어짐
  - 1-stage
  - anchor-free
    - cornernet
    - fcos
    - objects as points
efficientdet
- 본 연구에서 accuracy/efficiency를 동시에 추구하는 detector를 제안
contributions
1. Feature Fusion 방식을 제안 : BiFPN
  - weighted bidirectional feature network
2. object detection task에서의 compound scaling 방식을 제안
  - resolution
  - Backbone network
  - BiFPN width, depth
  - subnetwork (cls, reg) width, depth
3. EfficientDet 으로 SOTA

2. Related Work

1-stage detectors
- 본 연구에서는 1-stage detector의 디자인 방식을 사용
multi-scale feature representations
model scaling

3. BiFPN

BiFPN : 논문에서 사용한 multi-scale feature fusion 방법

3.2. Bidirectional cross scale connections

Original FPN
- Top-down flow 만 존재
- P7 -> P6 -> P5 -> ...
PANet
- Top-down + Botton-up
- Top-down 이후에 Botton-up path를 추가
BiFPN
- Simplified PANet
- Extra Edge를 추가
- Bidirectional Path를 여러번 반복
  - 반복횟수는 scaling factor : BiFPN : depth

3.3. Weighted Feature Fusion

def feature_fusion(w, input_nodes):
    output_nodes = []
    for i in range(len(input_nodes)):
        normalized_w[i] = w[i] / (sum(w) + eps)
        o_i = normalized_w[i] * input_nodes[i]
        output_nodes.append(o_i)

input node에 대한 contribution을 학습 : weighted
안정적인 학습을 위해 weight를 normalize해서 [0, 1]-ranged로 반영

4. EfficientDet

4.1. EfficientDet Architecture

1-stage retinanet과 유사한 구조
retinanet과의 차이점
- Backbnone : efficientnet 사용
- FPN 방식 : BiFPN 사용
- BiFPN과 subnetwork의 width/depth
  - retinanet은 fixed
  - efficientdet은 compound scaling
- subnet 에 BN을 추가

4.2. Compound Scaling

BiFPN의 depth/width
- w_bifpn = 64 * 1.35**theta
- d_bifpn = 2 + theta
Box/Clss subnet의 depth/width
- w = w_bifpn
- depth = 3 + [theta/3]
resolution
- 512 + theta*128
Backbone
- EfficinetNet B0 ~ B6

5. Experiments

학습 parameter
- SGD + momentum (0.9)
- weight decay 4e-5
- learning rate
  - warmup stage: 전체 epoch중 5%까지
    - [0, 0.08] 까지 linear 하게 증가
  - 일반 stage
    - cosine decay rule에 의해 감소시킴
- BatchNorm
  - bn decay : 0.997
  - eps : 1e-4
  - exponential moving average with decay : 0.9998
- focal loss
  - alpha : 0.25
  - gamma : 1.5
- auto-augmentation 사용

6. Ablation Study

Backbone network와 BiFPN
- (Baseline setting : retinanet) Resnet50 + FPN
  - resnet50 => efficientnet-B3
    - 37% -> 40%
  - FPN => BiFPN
    - 40% -> 44%
(BiFPN Cross-Scale connections) Feature fusion 방식을 비교
- Accuracy 비교
  - BiFPN > PANet (repeated) > BiFPN without weight
- Efficiency 비교
  - BiFPN > BiFPN without weight > PANet (repeated)
  - edge node를 삭제했기때문에 PANet보다 효율적
Softmax vs. Fast Normalized Fusion
- 정확도는 softmax가 약간 높지만, gpu inference time이 1.3배 정도 차이
Compound scaling
- 어느 1가지 factor를 scaling 하는 것보다 compound scaling하는 것이 효율적