reproducibilityindex.ai

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Authors: Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. Vi LD obtains 16.1 mask APr with a Res Net-50 backbone, even outperforming the supervised counterpart by 3.8.
Researcher Affiliation	Industry	Xiuye Gu1, Tsung-Yi Lin2, Weicheng Kuo1, Yin Cui1 1Google Research, 2Nvidia {xiuyegu, weicheng, yincui}@google.com tsungyil@nvidia.com
Pseudocode	No	The paper describes its method in detail with text and mathematical equations (e.g., LVi LD-text), but it does not include a clearly labeled "Pseudocode" or "Algorithm" block.
Open Source Code	Yes	Code and demo are open-sourced at https://github.com/tensorflow/tpu/ tree/master/models/official/detection/projects/vild.
Open Datasets	Yes	We mainly evaluate on LVIS (Gupta et al., 2019) with our new setting. ... COCO: Bansal et al. (2018) divide COCO-2017 (Lin et al., 2014)... PASCAL VOC (Everingham et al., 2010), COCO (Lin et al., 2014), and Objects365 (Shao et al., 2019).
Dataset Splits	Yes	LVIS: We benchmark on LVIS v1. ... We take its 866 frequent and common categories as the base categories CB, and hold out the 337 rare categories as the novel categories CN.
Hardware Specification	No	The paper mentions using different model backbones (e.g., Res Net-50, Efﬁcient Net-b7) and refers to 'TPU' in the GitHub link, but it does not specify the underlying hardware (e.g., specific GPU models, CPU types, or TPU versions/configurations) used for running the experiments.
Software Dependencies	No	The paper refers to various models and frameworks like Mask R-CNN, CLIP, and ALIGN, but it does not specify software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x, PyTorch 1.x).
Experiment Setup	Yes	The models use 1024 1024 as input image size, large-scale jittering augmentation of range [0.1, 2.0], synchronized batch normalization (Ioffe & Szegedy, 2015; Girshick et al., 2018) of batch size 256, weight decay of 4e-5, and an initial learning rate of 0.32. We train the model from scratch for 180,000 iterations, and divide the learning rate by 10 at 0.9 , 0.95 , and 0.975 of total iterations. The temperature τ is set to 0.01, and the maximum number of detections per image is 300.