Faster RCNN - myzwisc/CS766-Project GitHub Wiki

While Fast R-CNN [2] has tremendously reduced the running time of training procedure and achieves almost real-time rates using deep networks, test time computation had always been a bottleneck for real-time object detection. To overcome the issue, Faster R-CNN has been proposed. Compared to previous network structures, Faster R-CNN is much more efficient and accurate in gener- ating region proposals. Besides, by sharing features with down-stream detection network, the region proposal almost requires no cost, which enables Faster R- CNN to perform detection with high enough accuracy.

The Faster R-CNN has two components, the first one one being a deep fully connected network, which is used to propose regions, and the other one being the Fast R-CNN detector. The first component of Faster R-CNN is a Region Proposal Network (RPN), which takes as input an image, and generates proposals of rectangles that could potentially contain objects, with a score indicating the confidence of an existing object. To generate a region proposal, a small network is slided over the convolutional feature maps produced from the last convolutional layers, mapping each n by n spatial window into a lower-dimensional feature vector. Then two sibling fully-connected layers - a box-regression layer and a box-classification layer are used to compute the final proposed rectangle and also the score of confidence. In each location of the sliding windows, multiple region proposals are predicted simultaneously. The proposals are parameterized by k fixed reference boxes, which are called anchors.

The RPN is trained by Stochastic Gradient Descent (SGD) with classic back propagation algorithms. The weights in new layers are initialized according a zero-mean Gaussian distribution with standard deviation 0.01, while the other layers are initialized by the result trained from ImageNet Classification. The learning rate is set to 0.0001. A 0.9 momentum is used in order to speed up the convergence rate.

One of the most important characteristics of Faster R-CNN is that it allows sharing convolutional layers between RPN and Fast R-CNN networks. There are three different ways to implement shared convolutional layers: alternating training; approximate joint training and the non-approximate joint training. In alternating training, proposals are trained by RPN first and then tuned by Fast R-CNN, the result of which is used to initialize the RPN again. This whole procedure is iterated until some point. In approximate joint training, the RPN and the Fast R-CNN is merged together into one network. In each SGD during the training, the proposed regions computed by the forward pass is considered as fixed for the later Fast R-CNN detector. For back propagation, the loss function of the two networks are also combined. Different from approximate joint training, non-approximate training takes the proposed regions of RPN as its own input.

Compared to other state-of-the art algorithms, Faster R-CNN is much faster in terms of performing real-time detections and recognitions. The fastness comes from the fact that the two networks of Faster R-CNN share the same loss func- tions and also the same features, which results in less computation. Besides, it does not crop region proposals directly from the image. However, though being much faster, according to [4], it can only process 7 frames per second (FPS) with mAP 73.2%, thus there is still a huge space for improvement.