Simple Image-Level Classification Improves Open-Vocabulary Object Detection

Authors: Ruohuan Fang, Guansong Pang, Xiao Bai

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This is verified by extensive empirical results on two popular benchmarks, OV-LVIS and OV-COCO, which show that SIC-CADS achieves significant and consistent improvement when combined with different types of OVOD models.
Researcher Affiliation Academia 1School of Computer Science and Engineering, Beihang University 2School of Computing and Information Systems, Singapore Management University 3State Key Laboratory of Software Development Environment, Jiangxi Research Institute, Beihang University
Pseudocode No The paper describes methods using text and mathematical equations, but does not include any distinct pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/mala-lab/SIC-CADS.
Open Datasets Yes We evaluate our method on LVIS v1.0 (Gupta, Dollar, and Girshick 2019) and COCO (Lin et al. 2014) under the open-vocabulary settings, as defined by recent works (Zareian et al. 2021; Gu et al. 2021), with the benchmarks named as OVCOCO and OV-LVIS respectively. OV-LVIS: LVIS is a large-vocabulary instance segmentation dataset containing 1,203 categories. The categories are divided into three groups based on their appearance frequency in the dataset: frequent, common, and rare. Following the protocol introduced by (Gu et al. 2021), we treat the frequent and common categories as base categories (noted as LVISBase) to train our model. It considers 337 rare categories as novel categories during testing. OV-COCO: We follow the open-vocabulary setting defined by (Zareian et al. 2021) and split the categories into 48 base categories and 17 novel categories. Only base categories in COCO train2017 (noted as COCO-Base) are used for training.
Dataset Splits Yes Following the protocol introduced by (Gu et al. 2021), we treat the frequent and common categories as base categories (noted as LVISBase) to train our model. It considers 337 rare categories as novel categories during testing. Only base categories in COCO train2017 (noted as COCO-Base) are used for training. For OV-LVIS, our MLR model is trained using the imagelevel labels of LVIS-Base for 90,000 iterations with a batch size of 64 (48 epochs).
Hardware Specification No The paper does not specify the hardware used for running experiments (e.g., specific GPU/CPU models, memory details).
Software Dependencies No The paper mentions models like ResNet-50 with FPN and ViT-B/32 CLIP, and the AdamW optimizer, but does not provide specific version numbers for software libraries or dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The hyperparameters λB, λN, and γ are set to 0.8, 0.8, 0.5 for OV-LVIS and 0.8, 0.5, 0.7 for OVCOCO based on the ablation results. We use Adam W (Loshchilov and Hutter 2017) optimizer with an initial learning rate of 0.0002 to train our MLR module. For OV-LVIS, our MLR model is trained using the imagelevel labels of LVIS-Base for 90,000 iterations with a batch size of 64 (48 epochs). As for OV-COCO, we train our model for 12 epochs for a fair comparison with previous OVOD models. We use images of size 480 480, augmented with random resized cropping and horizontal flipping during training.