Multi-Modal Classifiers for Open-Vocabulary Object Detection

Authors: Prannay Kaul, Weidi Xie, Andrew Zisserman

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.
Researcher Affiliation Academia 1Visual Geometry Group, University of Oxford 2CMIC, Shanghai Jiao Tong University 3Shanghai AI Lab.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No https://www.robots.ox.ac.uk/vgg/research/mm-ovod/
Open Datasets Yes In this work, most experiments are based on the LVIS object detection dataset (Gupta et al., 2019), containing a large vocabulary and a longtailed distribution of object instances.
Dataset Splits Yes For evaluation, previous work evaluates OVOD models on the LVIS validation set (LVIS-val) for all categories treating rare classes as novel categories as it is guaranteed that no groundtruth box annotations whatsoever are provided at the training stage.
Hardware Specification Yes We conduct our experiments on 4 32GB V100 GPUs.
Software Dependencies No The paper mentions specific models like GPT-3 Da Vinci-002 and CLIP, but does not provide version numbers for general software dependencies (e.g., PyTorch, Python).
Experiment Setup Yes The training recipe is the same as Detic for fair comparison, using Federated Loss (Zhou et al., 2021) and repeat factor sampling (Gupta et al., 2019). While training our OVOD model on detection data only, DDET, we use a 4 schedule ( 58 LVIS-base epochs or 90k iterations with batch size of 64). When using additional image-labelled data (IN-L), we train jointly on DDET DIMG using a 4 schedule (90k iterations) with a sampling ratio of 1 : 4 and batch sizes of 64 and 256, respectively. This results in 15 IN-L epochs and an additional 11 LVIS-base epochs. For mini-batches containing images from DDET and DIMG we use input resolutions of 6402 and 3202, respectively. We conduct our experiments on 4 32GB V100 GPUs.