Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
Authors: Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of CLIPSelf is validated on open-vocabulary object detection and image segmentation benchmarks. For open-vocabulary object detection, we established a two-stage baseline based on frozen CLIP Vi Ts, and the fine-tuned models achieved new state-of-the-art performance on OV-COCO and OV-LVIS benchmarks, as well as on the transfer detection benchmark. |
| Researcher Affiliation | Collaboration | 1 S-Lab, Nanyang Technological University 2 The Chinese University of Hong Kong 3 The University of Hong Kong 4 Sense Time Research and Tetras.AI 5 Shanghai AI Laboratory |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Models and code are released at https://github.com/wusize/CLIPSelf. |
| Open Datasets | Yes | By default, we use the images in train2017 split of COCO dataset (Lin et al., 2014), which are exactly the training images of most downstream open-vocabulary benchmarks. ... For the OV-LVIS benchmark, we use the images from the train split of LVIS v1.0 (Gupta et al., 2019). |
| Dataset Splits | Yes | The mean accuracy (m Acc) of classifying region boxes annotated in COCO s val2017 split is used as the indicator for evaluation. |
| Hardware Specification | Yes | To train CLIPSelf, we use 8 A100 GPUs and set the batch size as 2 on each GPU. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer but does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages. |
| Experiment Setup | Yes | To train CLIPSelf, we use 8 A100 GPUs and set the batch size as 2 on each GPU. We train the models for 6 epochs using the Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 1e 5 and weight decay of 0.1. |