Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Authors: Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. Vi LD obtains 16.1 mask APr with a Res Net-50 backbone, even outperforming the supervised counterpart by 3.8.
Researcher Affiliation Industry Xiuye Gu1, Tsung-Yi Lin2, Weicheng Kuo1, Yin Cui1 1Google Research, 2Nvidia {xiuyegu, weicheng, yincui}@google.com tsungyil@nvidia.com
Pseudocode No The paper describes its method in detail with text and mathematical equations (e.g., LVi LD-text), but it does not include a clearly labeled "Pseudocode" or "Algorithm" block.
Open Source Code Yes Code and demo are open-sourced at https://github.com/tensorflow/tpu/ tree/master/models/official/detection/projects/vild.
Open Datasets Yes We mainly evaluate on LVIS (Gupta et al., 2019) with our new setting. ... COCO: Bansal et al. (2018) divide COCO-2017 (Lin et al., 2014)... PASCAL VOC (Everingham et al., 2010), COCO (Lin et al., 2014), and Objects365 (Shao et al., 2019).
Dataset Splits Yes LVIS: We benchmark on LVIS v1. ... We take its 866 frequent and common categories as the base categories CB, and hold out the 337 rare categories as the novel categories CN.
Hardware Specification No The paper mentions using different model backbones (e.g., Res Net-50, Efficient Net-b7) and refers to 'TPU' in the GitHub link, but it does not specify the underlying hardware (e.g., specific GPU models, CPU types, or TPU versions/configurations) used for running the experiments.
Software Dependencies No The paper refers to various models and frameworks like Mask R-CNN, CLIP, and ALIGN, but it does not specify software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x, PyTorch 1.x).
Experiment Setup Yes The models use 1024 1024 as input image size, large-scale jittering augmentation of range [0.1, 2.0], synchronized batch normalization (Ioffe & Szegedy, 2015; Girshick et al., 2018) of batch size 256, weight decay of 4e-5, and an initial learning rate of 0.32. We train the model from scratch for 180,000 iterations, and divide the learning rate by 10 at 0.9 , 0.95 , and 0.975 of total iterations. The temperature τ is set to 0.01, and the maximum number of detections per image is 300.