Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Authors: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on the PASCAL VOC detection benchmarks [4], where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast R-CNNs. Meanwhile, our method waives nearly all computational burdens of SS at test-time the effective running time for proposals is just 10 milliseconds. Using the expensive very deep models of [19], our detection method still has a frame rate of 5fps (including all steps) on a GPU, and thus is a practical object detection system in terms of both speed and accuracy (73.2% m AP on PASCAL VOC 2007 and 70.4% m AP on 2012).
Researcher Affiliation Collaboration Shaoqing Ren Kaiming He Ross Girshick Jian Sun Microsoft Research {v-shren, kahe, rbg, jiansun}@microsoft.com. Shaoqing Ren is with the University of Science and Technology of China. This work was done when he was an intern at Microsoft Research.
Pseudocode No The paper describes the algorithms and training scheme in paragraph text and figures, but does not include formal pseudocode blocks or algorithms.
Open Source Code Yes Code is available at https://github.com/Shaoqing Ren/faster_rcnn.
Open Datasets Yes We comprehensively evaluate our method on the PASCAL VOC 2007 detection benchmark [4]. This dataset consists of about 5k trainval images and 5k test images over 20 object categories. We also provide results in the PASCAL VOC 2012 benchmark for a few models. [4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results, 2007.
Dataset Splits Yes This dataset consists of about 5k trainval images and 5k test images over 20 object categories. We also provide results in the PASCAL VOC 2012 benchmark for a few models. For the Image Net pre-trained network, we use the fast version of ZF net [23] that has 5 conv layers and 3 fc layers, and the public VGG-16 model [19] that has 13 conv layers and 3 fc layers.
Hardware Specification Yes Table 4: Timing (ms) on a K40 GPU, except SS proposal is evaluated in a CPU.
Software Dependencies No Our implementation uses Caffe [10]. The paper does not specify the version of Caffe or any other software dependencies with version numbers.
Experiment Setup Yes We use a learning rate of 0.001 for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL dataset. We also use a momentum of 0.9 and a weight decay of 0.0005 [11]. Our implementation uses Caffe [10]. We randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01. For anchors, we use 3 scales with box areas of 128^2, 256^2, and 512^2 pixels, and 3 aspect ratios of 1:1, 1:2, and 2:1.