Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP
Authors: Yinqi Li, Jiahe Zhao, Hong Chang, RuiBing Hou, Shiguang Shan, Xilin Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un2CLIP significantly improves the original CLIP and previous CLIP improvement methods. |
| Researcher Affiliation | Academia | Yinqi Li1,2, Jiahe Zhao1,2, Hong Chang1,2 , Ruibing Hou1, Shiguang Shan1,2, Xilin Chen1,2 1State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China 2University of Chinese Academy of Sciences (CAS), China EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methodology using prose and mathematical equations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code and models are available at https://github.com/Li Yinqi/un2CLIP. |
| Open Datasets | Yes | For a fair comparison, we train un2CLIP on the CC3M dataset [42] over 1 epoch following [21]. We evaluate our finetuned CLIP models on the MMVP-VLM benchmark [9]. We employ the mean Intersection over Union (mIoU) metric and evaluate on eight datasets widely used for open-vocabulary semantic segmentation. These datasets can be categorized into two groups: (1) Without background category: PASCAL VOC20 (VOC20) [45], PASCAL Context59 (Ctx59) [46], COCO-Stuff (Stuff) [47], Cityscapes (City) [48], and ADE20K (ADE) [49]; (2) With a background category: PASCAL VOC (VOC21) [45], PASCAL Context (Ctx60) [46], and COCO Object (Object) [47]. For completeness, we also present the results on the classical evaluation tasks of CLIP, i.e., zero-shot classification and retrieval, in Section 4.5. These tasks and their standard benchmarks (Image Net-1K [55], CIFAR-10 [56], CIFAR-100 [56], Caltech-101 [57], SUN397 [58], FGVC Aircraft [59], Stanford Cars [60], Flickr30K [61], and COCO [62]). |
| Dataset Splits | Yes | The MMVP-VLM benchmark [9]. The benchmark covers 9 visual patterns, each comprising 15 image pairs (30 images) accompanied by textual descriptions. To compute the expectation term πΌπ±,π,π‘in practice, we random sample noises πand timesteps π‘(with π±from the test set). We first replace CC3M with Image Net-1K [55], a class-balanced dataset containing 1.3M images across 1000 categories. |
| Hardware Specification | Yes | un2CLIP is trained on 8 Nvidia-A100-40GB GPUs with a global batch size of 32, learning rate of 3e-7, using Adam W optimizer. We train this un CLIP model on the CC3M dataset [42], using a global batch size of 2048 following the configuration of stable-diffusion-2-1-base. The model is trained for 15K iterations (about 10 epochs over CC3M), taking about 5 days with 8 Nvidia-A100-40GB GPUs, as summarized in the last row of Table 8. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer' and references 'Stable un CLIP' and 'stable-diffusion-2-1' models, but does not specify explicit version numbers for programming languages (e.g., Python), frameworks (e.g., PyTorch), or other key software libraries. |
| Experiment Setup | Yes | un2CLIP is trained on 8 Nvidia-A100-40GB GPUs with a global batch size of 32, learning rate of 3e-7, using Adam W optimizer. For a fair comparison, we train un2CLIP on the CC3M dataset [42] over 1 epoch following [21], taking around 15 32 hours for different model types. The remaining hyper-parameters are kept the same as the training configuration of Stable un CLIP in the codebase. |