Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation
Authors: Shengcao Cao, Mengtian Li, James Hays, Deva Ramanan, Yu-Xiong Wang, Liangyan Gui
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform comprehensive empirical evaluation on the challenging MS COCO dataset and observe consistent gains, regardless of the distillation loss complexity (from a simple feature-matching loss in Table 3 to the most advanced, sophisticated losses in Figure 4). MTPD learns lightweight Retina Net and Mask R-CNN with state-of-the-art accuracy, even in heterogeneous backbone and input resolution settings. Perhaps most impressively, for the first time, we investigate heterogeneous distillation from Transformer-based teacher detectors to a convolution-based student, and find progressive distilla-tion is the key to bridge their gap (Figure 1, Table 5). |
| Researcher Affiliation | Collaboration | 1University of Illinois Urbana-Champaign 2Carnegie Mellon University 3Now at Waymo 4Georgia Institute of Technology. Correspondence to: Shengcao Cao <cao44@illinois.edu>. |
| Pseudocode | Yes | We design a heuristic algorithm, Backward Greedy Selection (BGS), to acquire a near-optimal distillation order O automatically (see pseudo-code in Algorithm 1 and illustration in Figure 3). |
| Open Source Code | Yes | Code available at https: //github.com/Shengcao-Cao/MTPD. |
| Open Datasets | Yes | We mainly evaluate on the challenging object detection dataset MS COCO 2017 (Lin et al., 2014), which contains bounding boxes and segmentation masks for 80 common object categories. We train our models on the split of train2017 (118k images) and report results on val2017 (5k images). We also evaluate on another object detection dataset Argoverse-HD (Chang et al., 2019), and a more challenging evaluation protocol in streaming perception (Li et al., 2020a). |
| Dataset Splits | Yes | We train our models on the split of train2017 (118k images) and report results on val2017 (5k images). |
| Hardware Specification | Yes | The second column denotes the optimal input resolution (that maximizes streaming accuracy). First, we discover that a lighter model and full-resolution input is much more helpful than having an accurate but complex model that needs to downsize input resolution. Second, MTPD further improves over the lightweight model. (Table 14 also mentions 'Tesla V100 GPU' for streaming accuracy experiments). |
| Software Dependencies | No | We implement detectors and their distillation using the MMDetection codebase (Chen et al., 2019b). While MMDetection is mentioned, no specific version number for it or other software libraries (e.g., Python, PyTorch) is provided. |
| Experiment Setup | Yes | We train on 8 GPUs for 12 epochs for each distillation. For MS COCO, we use the standard input resolution of 1, 333 800, with each GPU hosting 2 images...We use an initial learning rate of 0.01 (for Retina Net students) or 0.02 (for Mask R-CNN students). We use stochastic gradient descent and a momentum of 0.9. For the simple feature-matching loss (see Section 3.1), we perform a grid search over the hyper-parameter λ. While the optimal values are dependent on the architectures of the teacher and student models, we find that the performance is not very sensitive to λ between 0.3 and 0.8. We set λ = 0.5 for Retina Net students and λ = 0.8 for Mask R-CNN students. |