Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing CLIP Robustness via Cross-Modality Alignment

Authors: Xingyu Zhu, Beier Zhu, Shuo Wang, Kesen Zhao, Hanwang Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present the experimental results of our method under adversarial perturbations, including performance comparisons, ablation studies, and visualization analyses. (Section 4) and tables like Table 1: Classification accuracy (%) on 9 widely-used datasets.
Researcher Affiliation Academia 1University of Science and Technology of China 2Nanyang Technological University EMAIL, EMAIL
Pseudocode Yes The overall procedure of COLA is summarized in Algorithm 1, which outlines the projection-based alignment and OT-based matching steps for adversarially robust inference. (Section C Algorithm)
Open Source Code Yes Answer: [Yes] Justification: We have uploaded the codes in supplemental material.
Open Datasets Yes We evaluate our method on 14 classification datasets spanning a broad range of domains, including generic objects (Image Net [14], Caltech101 [20]), scenes (SUN397 [58]), textures (DTD [10]), satellite imagery (Euro SAT [23]), and various fine-grained categories such as pets, cars, flowers, food, and aircraft (Pets [39], Cars [26], Flowers [38], Food101 [6], Aircraft [34]).
Dataset Splits Yes We evaluate our method on 14 classification datasets spanning a broad range of domains, including generic objects (Image Net [14], Caltech101 [20]), scenes (SUN397 [58]), textures (DTD [10]), satellite imagery (Euro SAT [23]), and various fine-grained categories such as pets, cars, flowers, food, and aircraft (Pets [39], Cars [26], Flowers [38], Food101 [6], Aircraft [34]).
Hardware Specification Yes All experiments are conducted on a single NVIDIA 3090 GPU if not specified.
Software Dependencies No Our experiments are based on the pre-trained CLIP model, using Vi T-B/32 as the visual encoder and a Transformer as the text encoder. No specific software library versions (e.g., PyTorch, TensorFlow, CUDA) are mentioned.
Experiment Setup Yes The attack budgets, including PDG attack and CW acctack [36, 7], are set of Ďľa = 1/255 in default. The number of steps for attacks is set as 10. All attacks are bounded by a L radius. For each test image, we generate N = 5 augmented views including the original. For each class, we use the LLM to generate M = 50 text descriptions. We select the top-C = 256 components from the SVD of class text features to build the projection matrix.