CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Authors: Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of CLIPSelf is validated on open-vocabulary object detection and image segmentation benchmarks. For open-vocabulary object detection, we established a two-stage baseline based on frozen CLIP Vi Ts, and the fine-tuned models achieved new state-of-the-art performance on OV-COCO and OV-LVIS benchmarks, as well as on the transfer detection benchmark.
Researcher Affiliation Collaboration 1 S-Lab, Nanyang Technological University 2 The Chinese University of Hong Kong 3 The University of Hong Kong 4 Sense Time Research and Tetras.AI 5 Shanghai AI Laboratory
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Models and code are released at https://github.com/wusize/CLIPSelf.
Open Datasets Yes By default, we use the images in train2017 split of COCO dataset (Lin et al., 2014), which are exactly the training images of most downstream open-vocabulary benchmarks. ... For the OV-LVIS benchmark, we use the images from the train split of LVIS v1.0 (Gupta et al., 2019).
Dataset Splits Yes The mean accuracy (m Acc) of classifying region boxes annotated in COCO s val2017 split is used as the indicator for evaluation.
Hardware Specification Yes To train CLIPSelf, we use 8 A100 GPUs and set the batch size as 2 on each GPU.
Software Dependencies No The paper mentions using the Adam W optimizer but does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages.
Experiment Setup Yes To train CLIPSelf, we use 8 A100 GPUs and set the batch size as 2 on each GPU. We train the models for 6 epochs using the Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 1e 5 and weight decay of 0.1.