Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Authors: Yinqi Li, Jiahe Zhao, Hong Chang, RuiBing Hou, Shiguang Shan, Xilin Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un2CLIP significantly improves the original CLIP and previous CLIP improvement methods.
Researcher Affiliation Academia Yinqi Li1,2, Jiahe Zhao1,2, Hong Chang1,2 , Ruibing Hou1, Shiguang Shan1,2, Xilin Chen1,2 1State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China 2University of Chinese Academy of Sciences (CAS), China EMAIL, EMAIL
Pseudocode No The paper describes its methodology using prose and mathematical equations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code and models are available at https://github.com/Li Yinqi/un2CLIP.
Open Datasets Yes For a fair comparison, we train un2CLIP on the CC3M dataset [42] over 1 epoch following [21]. We evaluate our finetuned CLIP models on the MMVP-VLM benchmark [9]. We employ the mean Intersection over Union (mIoU) metric and evaluate on eight datasets widely used for open-vocabulary semantic segmentation. These datasets can be categorized into two groups: (1) Without background category: PASCAL VOC20 (VOC20) [45], PASCAL Context59 (Ctx59) [46], COCO-Stuff (Stuff) [47], Cityscapes (City) [48], and ADE20K (ADE) [49]; (2) With a background category: PASCAL VOC (VOC21) [45], PASCAL Context (Ctx60) [46], and COCO Object (Object) [47]. For completeness, we also present the results on the classical evaluation tasks of CLIP, i.e., zero-shot classification and retrieval, in Section 4.5. These tasks and their standard benchmarks (Image Net-1K [55], CIFAR-10 [56], CIFAR-100 [56], Caltech-101 [57], SUN397 [58], FGVC Aircraft [59], Stanford Cars [60], Flickr30K [61], and COCO [62]).
Dataset Splits Yes The MMVP-VLM benchmark [9]. The benchmark covers 9 visual patterns, each comprising 15 image pairs (30 images) accompanied by textual descriptions. To compute the expectation term 𝔼𝐱,𝝐,𝑑in practice, we random sample noises 𝝐and timesteps 𝑑(with 𝐱from the test set). We first replace CC3M with Image Net-1K [55], a class-balanced dataset containing 1.3M images across 1000 categories.
Hardware Specification Yes un2CLIP is trained on 8 Nvidia-A100-40GB GPUs with a global batch size of 32, learning rate of 3e-7, using Adam W optimizer. We train this un CLIP model on the CC3M dataset [42], using a global batch size of 2048 following the configuration of stable-diffusion-2-1-base. The model is trained for 15K iterations (about 10 epochs over CC3M), taking about 5 days with 8 Nvidia-A100-40GB GPUs, as summarized in the last row of Table 8.
Software Dependencies No The paper mentions 'Adam W optimizer' and references 'Stable un CLIP' and 'stable-diffusion-2-1' models, but does not specify explicit version numbers for programming languages (e.g., Python), frameworks (e.g., PyTorch), or other key software libraries.
Experiment Setup Yes un2CLIP is trained on 8 Nvidia-A100-40GB GPUs with a global batch size of 32, learning rate of 3e-7, using Adam W optimizer. For a fair comparison, we train un2CLIP on the CC3M dataset [42] over 1 epoch following [21], taking around 15 32 hours for different model types. The remaining hyper-parameters are kept the same as the training configuration of Stable un CLIP in the codebase.