Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning Object-Language Alignments for Open-Vocabulary Object Detection
Authors: Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, Jianfei Cai
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two benchmark datasets, COCO and LVIS, demonstrate our superior performance over the competing approaches on novel categories, e.g. achieving 32.0% m AP on COCO and 21.7% mask m AP on LVIS. |
| Researcher Affiliation | Collaboration | 1 Monash University 2 Byte Dance 3 The University of Hong Kong |
| Pseudocode | No | The paper describes the approach using textual explanations and mathematical formulations but does not include a distinct pseudocode or algorithm block. |
| Open Source Code | Yes | Code is available at: https://github.com/clin1223/VLDet. |
| Open Datasets | Yes | COCO and COCO Caption. Following open-vocabulary COCO setting (OV-COCO) (Zareian et al., 2021), the COCO-2017 dataset is manually divided into 48 base classes and 17 novel classes, which are proposed by the zero-shot object detection (Bansal et al., 2018). ... For images-text pairs data, we use COCO Caption (Chen et al., 2015) training set, which contains 5 human-generated captions for each image. |
| Dataset Splits | Yes | We keep 107,761 images with base class annotations as the training set and 4,836 images with base and novel class annotations as the validation set. |
| Hardware Specification | Yes | All the expriments are conducted on 8 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions software components like Faster R-CNN, CLIP, and CenterNet2, but does not provide specific version numbers for these or other relevant software dependencies (e.g., programming language versions, specific library versions). |
| Experiment Setup | Yes | In each mini-batch, the ratio of base-class detection data and image-text pair data is 1:4. For the warmup, we increase the learning rate from 0 to 0.002 for the first 1000 iterations. The model is trained for 90,000 iterations using SGD optimizer with batch size 8 and the learning rate is scaled down by a factor of 10 at 60,000 and 80,000 iterations. |