Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SuperCLIP: CLIP with Simple Classification Supervision
Authors: Weiheng Zhao, Zilong Huang, Jiashi Feng, Xinggang Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method effectively helps CLIP models recover rich textual supervision from all words in the text whether trained on original web data or rich recaptioned data leading to consistent improvements in zero-shot performance on classification and retrieval tasks, while also enhancing the vision encoder s features for purely visual tasks. Our main contributions can be summarized as follows: 3. Empirical results demonstrate that Super CLIP achieves improved performance on zero-shot classification and retrieval tasks, as well as on purely visual downstream tasks, thereby confirming its broad effectiveness. |
| Researcher Affiliation | Collaboration | 1School of EIC, Huazhong University of Science and Technology 2Byte Dance Code & Models: hustvl/Super CLIP |
| Pseudocode | No | The paper describes the methods in textual paragraphs and uses diagrams (e.g., Figure 2) but does not contain a clearly labeled pseudocode or algorithm block with structured steps. |
| Open Source Code | No | Code and models will be made publicly available upon acceptance. |
| Open Datasets | Yes | We pretrain our proposed Super CLIP and CLIP on a standard subset of the Datacomp dataset [11]. We demonstrate that our method consistently benefits CLIP across different model sizes, through zero-shot image classification on Image Net-1K [8] (val and v2) and image-text retrieval on COCO [35] and Flickr30K [66]. For linear probing image classification experiments, we follow the training protocol introduced in MAE [18]. For semantic segmentation and depth estimation, we follow a protocol similar to DINOv2 [44]. Enhance CLIP for Purely Visual Tasks. We demonstrate how our method enhances CLIP for purely visual tasks, through linear probing image classification on Image Net, semantic segmentation on Pascal [9] and ADE20K [71], and depth estimation on NYUv2 [43]. |
| Dataset Splits | Yes | We pretrain our proposed Super CLIP and CLIP on a standard subset of the Datacomp dataset [11]. We report zero-shot image classification accuracy (%) on Image Net-1K (val and v2), and zero-shot image and text retrieval (Recall@1, %) on COCO and Flickr30K. For linear probing image classification experiments, we freeze the backbone and train a linear classification head. For the semantic segmentation and depth estimation tasks, we similarly attach a linear head to the backbone, but fine-tune the entire model. |
| Hardware Specification | No | The paper's NeurIPS checklist states "See Appendix A.6." for compute resources, but Appendix A.6 is not present in the provided text. The paper mentions "demanding significant computational resources" and FLOPs counts in Table 3, but no specific hardware models (e.g., GPU/CPU types, memory) are detailed. |
| Software Dependencies | No | The paper mentions "Adam W with a cosine schedule" as an optimizer and "CLIP s subword-level tokenizer". It also references using the "LAION CLIP Benchmark framework [47]". However, specific version numbers for these or other key software components (like Python, PyTorch, or CUDA) are not provided. |
| Experiment Setup | Yes | Pretraining Setup. We pretrain our proposed Super CLIP and CLIP on a standard subset of the Datacomp dataset [11], which contains about 1.3B image-text pairs. All images are resized to a fixed resolution of 224 224, and the text is minimally processed with only basic tokenization. All experiments are conducted with a batch size of 16k, except for those under varying batch sizes analyzing the impact on CLIP. For fair comparison, all models adopt Adam W with a cosine schedule, using the same learning rate and weight decay as CLIP. |