Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
Authors: Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. Vi LD obtains 16.1 mask APr with a Res Net-50 backbone, even outperforming the supervised counterpart by 3.8. |
| Researcher Affiliation | Industry | Xiuye Gu1, Tsung-Yi Lin2, Weicheng Kuo1, Yin Cui1 1Google Research, 2Nvidia EMAIL EMAIL |
| Pseudocode | No | The paper describes its method in detail with text and mathematical equations (e.g., LVi LD-text), but it does not include a clearly labeled "Pseudocode" or "Algorithm" block. |
| Open Source Code | Yes | Code and demo are open-sourced at https://github.com/tensorflow/tpu/ tree/master/models/official/detection/projects/vild. |
| Open Datasets | Yes | We mainly evaluate on LVIS (Gupta et al., 2019) with our new setting. ... COCO: Bansal et al. (2018) divide COCO-2017 (Lin et al., 2014)... PASCAL VOC (Everingham et al., 2010), COCO (Lin et al., 2014), and Objects365 (Shao et al., 2019). |
| Dataset Splits | Yes | LVIS: We benchmark on LVIS v1. ... We take its 866 frequent and common categories as the base categories CB, and hold out the 337 rare categories as the novel categories CN. |
| Hardware Specification | No | The paper mentions using different model backbones (e.g., Res Net-50, Ef๏ฌcient Net-b7) and refers to 'TPU' in the GitHub link, but it does not specify the underlying hardware (e.g., specific GPU models, CPU types, or TPU versions/configurations) used for running the experiments. |
| Software Dependencies | No | The paper refers to various models and frameworks like Mask R-CNN, CLIP, and ALIGN, but it does not specify software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x, PyTorch 1.x). |
| Experiment Setup | Yes | The models use 1024 1024 as input image size, large-scale jittering augmentation of range [0.1, 2.0], synchronized batch normalization (Ioffe & Szegedy, 2015; Girshick et al., 2018) of batch size 256, weight decay of 4e-5, and an initial learning rate of 0.32. We train the model from scratch for 180,000 iterations, and divide the learning rate by 10 at 0.9 , 0.95 , and 0.975 of total iterations. The temperature ฯ is set to 0.01, and the maximum number of detections per image is 300. |