Scaling Open-Vocabulary Object Detection
Authors: Matthias Minderer, Alexey Gritsenko, Neil Houlsby
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales ( 10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With a Vi TL/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). We compare our best models to the literature in Table 1. |
| Researcher Affiliation | Industry | Matthias Minderer Alexey Gritsenko Neil Houlsby Google Deep Mind {mjlm, agritsenko, neilhoulsby}@google.com |
| Pseudocode | Yes | The human-curated label space was obtained by merging common dataset class lists with the Python code below. The machine-generated label space was obtained from the image-associated text, for each image separately, using the Python code below. |
| Open Source Code | Yes | Code and checkpoints are available on Git Hub.1 https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit |
| Open Datasets | Yes | We use the Web LI dataset [4] as the source of weak supervision for self-training. Web LI is a large dataset of images and texts available on the public Web. Evaluation & Fine-tuning Dataset Open-vocabulary object detection performance is evaluated using the LVIS [10] and ODin W13 [21] datasets. |
| Dataset Splits | Yes | Evaluation & Fine-tuning Dataset Open-vocabulary object detection performance is evaluated using the LVIS [10] and ODin W13 [21] datasets. As indicated in Table 1, some models are fine-tuned on the base annotations of LVIS, i.e. only annotations for frequent and common object categories as defined in the official annotations [10]. None of our models have seen any human annotations for LVIS rare categories, such that LVIS m APrare measures zero-shot performance. The tables use 'LVIS APval all' and 'LVIS APval rare'. |
| Hardware Specification | Yes | Hardware: TPU [13] v2 or v3 (for Band L-sized models) or v4 (for G-sized models). |
| Software Dependencies | No | Software: JAX [3], Flax [11], Scenic [7]. No version numbers are specified for these software dependencies. |
| Experiment Setup | Yes | We use the following hyperparameters for all of our models. Hyperparameters that vary between models are listed in Table A3. Optimizer: Adafactor variant as in [42] Learning rate schedule: Inverse square-root [36] with timescale 10 000 steps Learning rate for the text encoder: 2 10^-6 Token dropping rate during training: 0.5 Pseudo-annotation confidence score threshold: 0.3 |