Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
Authors: Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. Vi LD obtains 16.1 mask APr with a Res Net-50 backbone, even outperforming the supervised counterpart by 3.8. |
| Researcher Affiliation | Industry | Xiuye Gu1, Tsung-Yi Lin2, Weicheng Kuo1, Yin Cui1 1Google Research, 2Nvidia {xiuyegu, weicheng, yincui}@google.com tsungyil@nvidia.com |
| Pseudocode | No | The paper describes its method in detail with text and mathematical equations (e.g., LVi LD-text), but it does not include a clearly labeled "Pseudocode" or "Algorithm" block. |
| Open Source Code | Yes | Code and demo are open-sourced at https://github.com/tensorflow/tpu/ tree/master/models/official/detection/projects/vild. |
| Open Datasets | Yes | We mainly evaluate on LVIS (Gupta et al., 2019) with our new setting. ... COCO: Bansal et al. (2018) divide COCO-2017 (Lin et al., 2014)... PASCAL VOC (Everingham et al., 2010), COCO (Lin et al., 2014), and Objects365 (Shao et al., 2019). |
| Dataset Splits | Yes | LVIS: We benchmark on LVIS v1. ... We take its 866 frequent and common categories as the base categories CB, and hold out the 337 rare categories as the novel categories CN. |
| Hardware Specification | No | The paper mentions using different model backbones (e.g., Res Net-50, Efficient Net-b7) and refers to 'TPU' in the GitHub link, but it does not specify the underlying hardware (e.g., specific GPU models, CPU types, or TPU versions/configurations) used for running the experiments. |
| Software Dependencies | No | The paper refers to various models and frameworks like Mask R-CNN, CLIP, and ALIGN, but it does not specify software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x, PyTorch 1.x). |
| Experiment Setup | Yes | The models use 1024 1024 as input image size, large-scale jittering augmentation of range [0.1, 2.0], synchronized batch normalization (Ioffe & Szegedy, 2015; Girshick et al., 2018) of batch size 256, weight decay of 4e-5, and an initial learning rate of 0.32. We train the model from scratch for 180,000 iterations, and divide the learning rate by 10 at 0.9 , 0.95 , and 0.975 of total iterations. The temperature τ is set to 0.01, and the maximum number of detections per image is 300. |