reproducibilityindex.ai

Scaling Open-Vocabulary Object Detection

Authors: Matthias Minderer, Alexey Gritsenko, Neil Houlsby

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales ( 10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With a Vi TL/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). We compare our best models to the literature in Table 1.
Researcher Affiliation	Industry	Matthias Minderer Alexey Gritsenko Neil Houlsby Google Deep Mind {mjlm, agritsenko, neilhoulsby}@google.com
Pseudocode	Yes	The human-curated label space was obtained by merging common dataset class lists with the Python code below. The machine-generated label space was obtained from the image-associated text, for each image separately, using the Python code below.
Open Source Code	Yes	Code and checkpoints are available on Git Hub.1 https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit
Open Datasets	Yes	We use the Web LI dataset [4] as the source of weak supervision for self-training. Web LI is a large dataset of images and texts available on the public Web. Evaluation & Fine-tuning Dataset Open-vocabulary object detection performance is evaluated using the LVIS [10] and ODin W13 [21] datasets.
Dataset Splits	Yes	Evaluation & Fine-tuning Dataset Open-vocabulary object detection performance is evaluated using the LVIS [10] and ODin W13 [21] datasets. As indicated in Table 1, some models are ﬁne-tuned on the base annotations of LVIS, i.e. only annotations for frequent and common object categories as deﬁned in the ofﬁcial annotations [10]. None of our models have seen any human annotations for LVIS rare categories, such that LVIS m APrare measures zero-shot performance. The tables use 'LVIS APval all' and 'LVIS APval rare'.
Hardware Specification	Yes	Hardware: TPU [13] v2 or v3 (for Band L-sized models) or v4 (for G-sized models).
Software Dependencies	No	Software: JAX [3], Flax [11], Scenic [7]. No version numbers are specified for these software dependencies.
Experiment Setup	Yes	We use the following hyperparameters for all of our models. Hyperparameters that vary between models are listed in Table A3. Optimizer: Adafactor variant as in [42] Learning rate schedule: Inverse square-root [36] with timescale 10 000 steps Learning rate for the text encoder: 2 10^-6 Token dropping rate during training: 0.5 Pseudo-annotation conﬁdence score threshold: 0.3