CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
Authors: Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of CLIPSelf is validated on open-vocabulary object detection and image segmentation benchmarks. For open-vocabulary object detection, we established a two-stage baseline based on frozen CLIP Vi Ts, and the fine-tuned models achieved new state-of-the-art performance on OV-COCO and OV-LVIS benchmarks, as well as on the transfer detection benchmark. |
| Researcher Affiliation | Collaboration | 1 S-Lab, Nanyang Technological University 2 The Chinese University of Hong Kong 3 The University of Hong Kong 4 Sense Time Research and Tetras.AI 5 Shanghai AI Laboratory |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Models and code are released at https://github.com/wusize/CLIPSelf. |
| Open Datasets | Yes | By default, we use the images in train2017 split of COCO dataset (Lin et al., 2014), which are exactly the training images of most downstream open-vocabulary benchmarks. ... For the OV-LVIS benchmark, we use the images from the train split of LVIS v1.0 (Gupta et al., 2019). |
| Dataset Splits | Yes | The mean accuracy (m Acc) of classifying region boxes annotated in COCO s val2017 split is used as the indicator for evaluation. |
| Hardware Specification | Yes | To train CLIPSelf, we use 8 A100 GPUs and set the batch size as 2 on each GPU. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer but does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages. |
| Experiment Setup | Yes | To train CLIPSelf, we use 8 A100 GPUs and set the batch size as 2 on each GPU. We train the models for 6 epochs using the Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 1e 5 and weight decay of 0.1. |