Open-Vocabulary Object Detection upon Frozen Vision and Language Models
Authors: Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detectiontailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state-of-the-art on LVIS open-vocabulary detection benchmark at system level. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speedup and compute savings. The code will be released 1. We demonstrate the efficacy of F-VLM on LVIS (Gupta et al., 2019), COCO (Lin et al., 2014) and Objects365 (Shao et al., 2019). |
| Researcher Affiliation | Industry | Google Research, Brain Team; Google Research, Perception {weicheng, yincui, xiuyegu, ajpierrovi, anelia}@google.com |
| Pseudocode | No | The paper describes the architecture and steps of F-VLM using text and diagrams (Figure 2), but does not present a formal pseudocode or algorithm block. |
| Open Source Code | No | The code will be released 1. (This indicates future release, not current availability.) |
| Open Datasets | Yes | We evaluate our approach on the LVIS dataset (Gupta et al., 2019) which contains a large and diverse set of 1203 object categories suitable for open-vocabulary detection. COCO (Lin et al., 2014) and Objects365 (Shao et al., 2019). Ego4D (Grauman et al., 2022). |
| Dataset Splits | Yes | Following the existing works (Gu et al., 2022; Zhong et al., 2022), we treat the frequent and common categories as the base categories CB for training, and hold out the rare categories as novel categories CN for testing. Mask APr is the main metric we benchmark on. To ensure reproducibility, we report the mean of 5 independent runs following the protocol of (Gu et al., 2022) and the best practice of LVIS challenge (Gupta et al., 2019). This setup divides COCO vocabulary into 48 base categories for training and 17 novel categories for testing. We follow the standard practice and report results in the generalized detection settings without instance segmentation. Similar to LVIS, we report the mean of 5 independent runs to ensure reproducibility. |
| Hardware Specification | No | The paper mentions 'TPUv3 cores' when discussing training resources but does not provide specific details such as model numbers, memory, or clock speeds, which are required for a hardware specification. |
| Software Dependencies | No | The paper mentions 'CLIP (Radford et al., 2021)' and 'Mask R-CNN (He et al., 2017)' as models/frameworks used, but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | We train the model for 46.1k iterations with 1024x1024 image size, large scale jittering (Ghiasi et al., 2021), batch size 256, weight decay 1e-4, momentum 0.9, and an initial learning rate 0.36. For the score combination, we use α = 0.35 and β = 0.65 in equation 5. We use a maximum of 300 detections per image, and set temperature T = 0.01 in equation 4. Table 12 summarizes the hyper-parameters we use for LVIS and COCO experiments. |