Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Authors: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun
NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on the PASCAL VOC detection benchmarks [4], where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast R-CNNs. Meanwhile, our method waives nearly all computational burdens of SS at test-time the effective running time for proposals is just 10 milliseconds. Using the expensive very deep models of [19], our detection method still has a frame rate of 5fps (including all steps) on a GPU, and thus is a practical object detection system in terms of both speed and accuracy (73.2% m AP on PASCAL VOC 2007 and 70.4% m AP on 2012). |
| Researcher Affiliation | Collaboration | Shaoqing Ren Kaiming He Ross Girshick Jian Sun Microsoft Research {v-shren, kahe, rbg, jiansun}@microsoft.com. Shaoqing Ren is with the University of Science and Technology of China. This work was done when he was an intern at Microsoft Research. |
| Pseudocode | No | The paper describes the algorithms and training scheme in paragraph text and figures, but does not include formal pseudocode blocks or algorithms. |
| Open Source Code | Yes | Code is available at https://github.com/Shaoqing Ren/faster_rcnn. |
| Open Datasets | Yes | We comprehensively evaluate our method on the PASCAL VOC 2007 detection benchmark [4]. This dataset consists of about 5k trainval images and 5k test images over 20 object categories. We also provide results in the PASCAL VOC 2012 benchmark for a few models. [4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results, 2007. |
| Dataset Splits | Yes | This dataset consists of about 5k trainval images and 5k test images over 20 object categories. We also provide results in the PASCAL VOC 2012 benchmark for a few models. For the Image Net pre-trained network, we use the fast version of ZF net [23] that has 5 conv layers and 3 fc layers, and the public VGG-16 model [19] that has 13 conv layers and 3 fc layers. |
| Hardware Specification | Yes | Table 4: Timing (ms) on a K40 GPU, except SS proposal is evaluated in a CPU. |
| Software Dependencies | No | Our implementation uses Caffe [10]. The paper does not specify the version of Caffe or any other software dependencies with version numbers. |
| Experiment Setup | Yes | We use a learning rate of 0.001 for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL dataset. We also use a momentum of 0.9 and a weight decay of 0.0005 [11]. Our implementation uses Caffe [10]. We randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01. For anchors, we use 3 scales with box areas of 128^2, 256^2, and 512^2 pixels, and 3 aspect ratios of 1:1, 1:2, and 2:1. |