DesCo: Learning Object Recognition with Rich Language Descriptions

Authors: Liunian Li, Zi-Yi Dou, Nanyun Peng, Kai-Wei Chang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On two novel object detection benchmarks, LVIS and Omini Label, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.
Researcher Affiliation Academia Liunian Harold Li Zi-Yi Dou Nanyun Peng Kai-Wei Chang University of California, Los Angeles {liunian.harold.li,zdou,violetpeng,kwchang}@cs.ucla.edu
Pseudocode Yes Figure 3: Algorithms for generating queries from detection data and grounding data. Algorithm 1: B RN 4 are the bounding boxes of an image; E are M positive objects (entities) that appear in the image; V are the descriptions of all candidate objects in a pre-defined vocabulary; T {0, 1}N M denotes the gold alignment between boxes and entities. We first prompt LLM to generate descriptions for the positive entities and propose confusable entities and descriptions; the prompt is included in the appendix (Line 3).
Open Source Code Yes Code is available at https://github.com/liunian-harold-li/DesCo.
Open Datasets Yes Following GLIP [25], we train the models on 1) O365 (Objects365 [39]), consisting of 0.66M images and 365 categories; 2) Gold G that is curated by MDETR [17] and contains 0.8M human-annotated images sourced from Flickr30k [35], Visual Genome [19], and GQA [15]; 3) CC3M [40]: the web-scraped Conceptual Captions dataset with the same pseudo-boxes used by GLIP.
Dataset Splits No The paper mentions using LVIS Mini Val for evaluation, but it does not specify training/validation splits for the primary datasets (O365, Gold G, CC3M) used for fine-tuning their models.
Hardware Specification Yes Experiments can be replicated with 8 GPUs each with 32GB memories.
Software Dependencies No The paper mentions using Swin Transformer, BERT, and RoBERTa as backbones, but it does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes For DESCO-GLIP, we fine-tune with a batch size of 16 and a learning rate of 5 10 5 for 300K steps; for DESCO-FIBER, we fine-tune with a batch size of 8 and a learning rate of 1 10 5 for 200K steps.