Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

Authors: Shengcao Cao, Mengtian Li, James Hays, Deva Ramanan, Yu-Xiong Wang, Liangyan Gui

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform comprehensive empirical evaluation on the challenging MS COCO dataset and observe consistent gains, regardless of the distillation loss complexity (from a simple feature-matching loss in Table 3 to the most advanced, sophisticated losses in Figure 4). MTPD learns lightweight Retina Net and Mask R-CNN with state-of-the-art accuracy, even in heterogeneous backbone and input resolution settings. Perhaps most impressively, for the first time, we investigate heterogeneous distillation from Transformer-based teacher detectors to a convolution-based student, and find progressive distilla-tion is the key to bridge their gap (Figure 1, Table 5).
Researcher Affiliation Collaboration 1University of Illinois Urbana-Champaign 2Carnegie Mellon University 3Now at Waymo 4Georgia Institute of Technology. Correspondence to: Shengcao Cao <cao44@illinois.edu>.
Pseudocode Yes We design a heuristic algorithm, Backward Greedy Selection (BGS), to acquire a near-optimal distillation order O automatically (see pseudo-code in Algorithm 1 and illustration in Figure 3).
Open Source Code Yes Code available at https: //github.com/Shengcao-Cao/MTPD.
Open Datasets Yes We mainly evaluate on the challenging object detection dataset MS COCO 2017 (Lin et al., 2014), which contains bounding boxes and segmentation masks for 80 common object categories. We train our models on the split of train2017 (118k images) and report results on val2017 (5k images). We also evaluate on another object detection dataset Argoverse-HD (Chang et al., 2019), and a more challenging evaluation protocol in streaming perception (Li et al., 2020a).
Dataset Splits Yes We train our models on the split of train2017 (118k images) and report results on val2017 (5k images).
Hardware Specification Yes The second column denotes the optimal input resolution (that maximizes streaming accuracy). First, we discover that a lighter model and full-resolution input is much more helpful than having an accurate but complex model that needs to downsize input resolution. Second, MTPD further improves over the lightweight model. (Table 14 also mentions 'Tesla V100 GPU' for streaming accuracy experiments).
Software Dependencies No We implement detectors and their distillation using the MMDetection codebase (Chen et al., 2019b). While MMDetection is mentioned, no specific version number for it or other software libraries (e.g., Python, PyTorch) is provided.
Experiment Setup Yes We train on 8 GPUs for 12 epochs for each distillation. For MS COCO, we use the standard input resolution of 1, 333 800, with each GPU hosting 2 images...We use an initial learning rate of 0.01 (for Retina Net students) or 0.02 (for Mask R-CNN students). We use stochastic gradient descent and a momentum of 0.9. For the simple feature-matching loss (see Section 3.1), we perform a grid search over the hyper-parameter λ. While the optimal values are dependent on the architectures of the teacher and student models, we find that the performance is not very sensitive to λ between 0.3 and 0.8. We set λ = 0.5 for Retina Net students and λ = 0.8 for Mask R-CNN students.