Scaling Open-Vocabulary Object Detection

Authors: Matthias Minderer, Alexey Gritsenko, Neil Houlsby

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales ( 10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With a Vi TL/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). We compare our best models to the literature in Table 1.
Researcher Affiliation Industry Matthias Minderer Alexey Gritsenko Neil Houlsby Google Deep Mind {mjlm, agritsenko, neilhoulsby}@google.com
Pseudocode Yes The human-curated label space was obtained by merging common dataset class lists with the Python code below. The machine-generated label space was obtained from the image-associated text, for each image separately, using the Python code below.
Open Source Code Yes Code and checkpoints are available on Git Hub.1 https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit
Open Datasets Yes We use the Web LI dataset [4] as the source of weak supervision for self-training. Web LI is a large dataset of images and texts available on the public Web. Evaluation & Fine-tuning Dataset Open-vocabulary object detection performance is evaluated using the LVIS [10] and ODin W13 [21] datasets.
Dataset Splits Yes Evaluation & Fine-tuning Dataset Open-vocabulary object detection performance is evaluated using the LVIS [10] and ODin W13 [21] datasets. As indicated in Table 1, some models are fine-tuned on the base annotations of LVIS, i.e. only annotations for frequent and common object categories as defined in the official annotations [10]. None of our models have seen any human annotations for LVIS rare categories, such that LVIS m APrare measures zero-shot performance. The tables use 'LVIS APval all' and 'LVIS APval rare'.
Hardware Specification Yes Hardware: TPU [13] v2 or v3 (for Band L-sized models) or v4 (for G-sized models).
Software Dependencies No Software: JAX [3], Flax [11], Scenic [7]. No version numbers are specified for these software dependencies.
Experiment Setup Yes We use the following hyperparameters for all of our models. Hyperparameters that vary between models are listed in Table A3. Optimizer: Adafactor variant as in [42] Learning rate schedule: Inverse square-root [36] with timescale 10 000 steps Learning rate for the text encoder: 2 10^-6 Token dropping rate during training: 0.5 Pseudo-annotation confidence score threshold: 0.3