[논문리뷰] : Focal Loss for Dense Object Detection (Retinanet) - penny4860/study-note GitHub Wiki

1. 정리

요약

1-stage 방식의 문제점이 class imbalance 문제라고 진단
- 2-stage의 경우 rpn에서 easy negative를 제거함으로써 이 문제를 해결
해결책으로 focal loss를 제안
- modulating term : (1-pt)**gamma
  - 분류 난이도에 따른 weighting
  - pt가 큰(쉬운)sample의 scale을 낮춘다.
retinanet 모델로 당시 sota
- 1-stage가 SOTA찍은건 최초
- 성능향상요인
  - fpn
  - focal loss + alpha balancing
    - gamma=2, alpha=0.25
  - cls subnet의 마지막 layer의 초기화 방식
    - bias = -log((1-pi) / pi)
    - pi = 0.01
      - background sample이 많으므로 backgound에 prior를 줌.

질문

2. 내용

1. Introduction

1-stage 방식의 경우 training 과정에서의 class imbalance 문제로 detector의 성능이 떨어진다.
- dense sampling of possible object locations
2-stage 방식은 이 문제를 어떻게 해결하는가
- proposal stage 에서 candidate region을 줄인다.
- classification stage 에서 sampling heuristic을 적용
  - foreground / background의 비율을 조절해서 학습 : OHEM
1-stage에서도 sampling heuristic을 적용할 수는 없는가?
- 적용할 수 있지만, easily classified backgound example이 많아서 비효율적임
  - (2-stage 방식의 경우, proposal stage에서 easily classified backgound는 거의 없어짐.)

2. Related Work

3. Focal Loss

CE-loss
- 개념
  - cross entropy는 inverse log (target) probability
- 수식
  - ce(pt) = -log(pt)
    - pt = p (y=1 일때)
    - pt = 1-p (y=-1 일때)
- 문제점
  - easy example의 loss가 작지않다.
    - pt가 1.0에 가까울수록 easy example (분류가 쉬운 샘플)
  - easy example의 숫자가 너무 많을 경우 rare class의 loss를 압도한다.
    - 1-stage OD의 경우
      - easy negative sample의 숫자가 매우크고
      - positive sample은 매우 rare함.
      - 학습할때 easy negative sample의 loss를 줄이느라 정작 positive sample의 loss는 영향력이 떨어진다.

3.1. Balanced Cross Entropy

개념
- Class 별로 weight를 다르게 두는 방법
- 1-stage OD에서는 positive sample이 작으므로 positive loss에 weight를 준다.
수식
- ce(pt) = -log(pt) * alpha_t
  - alpha_t = alpha (y=1)
  - alpha_t = 1-alpha (y=-1)
- alpha로 positive/negative의 중요도를 조절
문제점
- class별 중요도를 조절할 수 있지만
- hard/easy sample의 중요도를 조절할수 없다.

3.2. Focal Loss Definition

개념
- hard example의 loss는 유지하면서
- easy example의 loss를 낮춘다.
수식
- FL(pt) = ce(pt) * (1-pt)**gamma
  - (1-pt)**gamma : modulation term
    - pt ~ 1.0 : easy example
      - modulation term ~ 0 : loss scale이 작아진다.
      - easy sample 에서는 loss가 작아진다.
    - pt ~ 0.0 : hard example

3.3. Class Imbalance and Model Initialization

일반적인 classfication 문제는 pos:neg를 동등한 확률로 초기화
imbalance task에서는 rare class의 확률을 낮게 초기화하는 것이 안정적으로 학습된다.
- bias = -log((1-pi) / pi)

3.4. Class Imbalance and 2-stage

2-stage detector가 imbalance 문제를 해결하는 방법

2-stage cascade
- object location을 1000개로 줄인다.
- RPN이 object 같은 영역만 출력하므로 대부분의 easy negative는 filtering
mini-batch sampling
- 2번째 stage에서 fg:bg의 비율을 1:3으로 맞춘다.

4. Retinanet

모델 구조
- Backbone
  - in : image
  - out : [C3, C4, C5]
- FPN
  - out : [P3, P4, P5, P6, P7]
    - 각각의 shape : (W, H, 256)
- subnet
  - clsout : [P3_cls, P4_cls, P5_cls, P6_cls, P7_cls]
    - 각각의 shape : (W, H, K*A)
    - retinanet 최종출력 1: concat : (R, K)
      - R개의 possible region에 대해서 K-class를 예측
  - regout : [P3_reg, P4_reg, P5_reg, P6_reg, P7_reg]
    - 각각의 shape : (W, H, 4*A)
    - retinanet 최종출력 2: concat : (R, 4)
      - R개의 possible region에 대해서 4개의 scala를 예측
anchors
- 1 grid당 9개의 anchor
  - 3-aspect ratio
  - 3-scale
    - {2**0, 2**(1/3), 2**(2/3)}
학습할때의 anchor assignment rule
- 모든 anchor는 2개 vector에 assign 되어야함.
  - concat : (R, K)
  - concat : (R, 4)
- assignment
  - IOU >= 0.5 : GT
    - K length vector에서 해당 class를 set
    - 4 length vector에서 좌표 offset 설정
  - IOU < 0.4 : BG
    - K length vector는 모두 0
    - 4 length vector는 설정 안함.
classification subnet
- input : Pn
  - [W, H, 256]
- 연산과정
  - (3, 3, 256)-conv 를 4번반복
  - (3, 3, K*A)-conv 를 수행
- output : Pn_cls_out
  - [W, H, K*A]
    - K : class 숫자
    - A : anchor 숫자
  - Grid*Anchor 마다 K-vector를 출력
regression subnet
- input : Pn
  - [W, H, 256]
- 연산과정
  - (3, 3, 256)-conv 를 4번반복
  - (3, 3, 4*A)-conv 를 수행
- output : Pn_reg_out
  - [W, H, 4*A]
    - 4 : Box coordinate
    - A : anchor 숫자
  - Grid*Anchor 마다 Box coordinate를 출력

4.1. Inference / Training

Inference 과정
- input : image
- output
  - cls subnet의 출력
    - (P3, P4, P5, P6, P7)-clsout
    - 5개 level의 3d-tensor를 2d-tensor로 reshape하고 concat
      - [G, K]
  - reg subnet의 출력 : [G, 4]
    - (P3, P4, P5, P6, P7)-regout
    - 5개 level의 3d-tensor를 2d-tensor로 reshape하고 concat
      - [G, 4]
- 후처리
  - nms
Initialization
- Backbone : Imagenet model
- cls subnet의 마지막 layer
  - bias = -log((1-pi) / pi)
  - background에 prior를 준다.
- 나머지 layer
  - sigma = 0.01, bias = 0
Optimization
- SGD + momentum (0.9)
- lr = [0.01, 0.001, 0.0001]
- weight decay = 0.0001
- 2 batches per GPU

5. Experiments

학습데이터 : Coco dataset으로 학습
- 학습 : train(80K) + val(35k)
- Validation : val(5k)

5.1. Training Dense Detection

Loss와 Init에 대한 실험
- Cross Enttory + Normal Init : 학습이 잘 안됐음.
- Cross Enttory + Prior Init : 30.2 AP
- Cross Enttory + Prior Init + Alpha Balanced: 30.2 AP + 0.9
- Cross Enttory + Prior Init + Alpha Balanced + Focal Loss: 30.2 AP + 0.9 + 2.9
Analysis of the Focal Loss
- 실험방법
  - Training된 retinanet으로 다수의 sample의 focal loss를 계산
  - positive / negative 에 대한 loss value를 정렬하여 CDF를 그린다.
- 결과
  - loss가 큰 hard sample에 대부분의 loss 가 몰려있음.
    - gamma가 커질수록
    - foreground 보다는 backgound에서
- 해석
  - background sample은 easy negative가 많다 더욱 hard sample에 loss가 몰린다.
Focal Loss vs. OHEM
- OHEM
  - loss가 큰 negative sample로 mini-batch를 구성하는 방법
  - 2-stage 방식에서 2nd stage에서 주로 사용
- Focal Loss 와의 비교
  - 공통점 : high loss sample에 집중한다는 것.
  - 차이점: OHEM은 easy negative sample을 아예 무시