DesCo: Learning Object Recognition with Rich Language Descriptions
Authors: Liunian Li, Zi-Yi Dou, Nanyun Peng, Kai-Wei Chang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On two novel object detection benchmarks, LVIS and Omini Label, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin. |
| Researcher Affiliation | Academia | Liunian Harold Li Zi-Yi Dou Nanyun Peng Kai-Wei Chang University of California, Los Angeles {liunian.harold.li,zdou,violetpeng,kwchang}@cs.ucla.edu |
| Pseudocode | Yes | Figure 3: Algorithms for generating queries from detection data and grounding data. Algorithm 1: B RN 4 are the bounding boxes of an image; E are M positive objects (entities) that appear in the image; V are the descriptions of all candidate objects in a pre-defined vocabulary; T {0, 1}N M denotes the gold alignment between boxes and entities. We first prompt LLM to generate descriptions for the positive entities and propose confusable entities and descriptions; the prompt is included in the appendix (Line 3). |
| Open Source Code | Yes | Code is available at https://github.com/liunian-harold-li/DesCo. |
| Open Datasets | Yes | Following GLIP [25], we train the models on 1) O365 (Objects365 [39]), consisting of 0.66M images and 365 categories; 2) Gold G that is curated by MDETR [17] and contains 0.8M human-annotated images sourced from Flickr30k [35], Visual Genome [19], and GQA [15]; 3) CC3M [40]: the web-scraped Conceptual Captions dataset with the same pseudo-boxes used by GLIP. |
| Dataset Splits | No | The paper mentions using LVIS Mini Val for evaluation, but it does not specify training/validation splits for the primary datasets (O365, Gold G, CC3M) used for fine-tuning their models. |
| Hardware Specification | Yes | Experiments can be replicated with 8 GPUs each with 32GB memories. |
| Software Dependencies | No | The paper mentions using Swin Transformer, BERT, and RoBERTa as backbones, but it does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | For DESCO-GLIP, we fine-tune with a batch size of 16 and a learning rate of 5 10 5 for 300K steps; for DESCO-FIBER, we fine-tune with a batch size of 8 and a learning rate of 1 10 5 for 200K steps. |