Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Authors: Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we fill this gap by studying adversarial finetuning for CLIP text encoders, proposing Levenshtein Efficient Adversarial Finetuning (LEAF). Motivated by recent advancements in the image domain, we optimize the same objective as Schlarmann et al. [2024], allowing us to replace the text encoder in tasks like text-to-image generation, without needing to finetune the rest of the pipeline. Moreover, to make adversarial finetuning faster in the text domain, we propose an attack that can be parallelized within training batches, accelerating the approach of Abad Rocamora et al. [2024] by an order of magnitude with very little loss of performance. Our models, LEAF, are able to improve the zero-shot adversarial accuracy of CLIP models from 44.5% to 63.3% in AG-News at distance k = 1 (one character change). When plugged into Stable Diffusion [Rombach et al., 2022, Podell et al., 2024], we achieve higher quality images under character-level perturbations. For retrieval tasks, our models achieve a recall 10 points higher on average than non-robust CLIP models at k = 2. Moreover, when inverting the embeddings of text encoders through direct optimization, we show that with LEAF models, we can recover a higher percentage of the original sentence. This results in LEAF encoders being more interpretable. Overall, we show the robustness of CLIP text encoders can be improved with minimal effects on the clean performance in several tasks. We believe our robust CLIP models can make future models incorporating CLIP more robust and interpretable. Our code and models can be found in github.com/LIONS-EPFL/LEAF and huggingface.co/LEAF-CLIP respectively.
Researcher Affiliation Academia Elias Abad Rocamora , Christian Schlarmann , Naman Deep Singh , Yongtao Wu , Matthias Hein , Volkan Cevher : LIONS Ecole Polytechnique F ed erale de Lausanne, Switzerland : T ubingen AI center, University of T ubingen, Germany {name.surname}@{epfl.ch, uni-tuebingen.de}
Pseudocode Yes Algorithm 1 LEAF batched attack 1: Inputs: Text encoder fθ : S(Γ) Rh, batch {Si}B i=1, loss function L, radius k, number of simultaneous perturbations ρ, alphabet Γ, test character t and flag for semantic constraints Cons. 2: ˆSi = Si i [B] Initialize perturbations with clean sentences. 3: for 1, , k do 4: pij Unif. [2 | ˆSi| + 1] i [B] j [ρ] Sample ρ positions in every sentence. 5: S = n ψ ϕ( ˆSi) pij t oρ i=1 Replace the test character in all pij. 6: if Cons then Use Algorithm 2 to check if the perturbation is valid, revert otherwise. (...) Algorithm 2 Semantic constraints 1: Inputs: Sentence S and perturbation S . 2: m = |words(S)| 3: n = |words(S )| We extract English words using NLTK: https://www.nltk.org/ 4: return m > n
Open Source Code Yes Our code and models can be found in github.com/LIONS-EPFL/LEAF and huggingface.co/LEAF-CLIP respectively. Our source code and documentation are published at https://github. com/LIONS-EPFL/LEAF.
Open Datasets Yes We train our text encoders for 30 epochs on the first 80, 000 samples of the Data Comp-small dataset [Gadre et al., 2023] with a batch size of 128 sentences, k = 1, ρ = 50 and semantic constraints, see Section 4.2.2, employing CLIP-Vi T-L/14, Open CLIP-Vi T-H/14, Open CLIP-Vi T-g/14 and Open CLIP-Vi T-big G/14 models. For zero-shot image classification, we measure the clean and robust accuracy on 13 datasets: Cal Tech101 Griffin et al. [2007], Stanford Cars Krause et al. [2013], CIFAR10, CIFAR100 Krizhevsky [2009], DTD Cimpoi et al. [2014], Euro SAT Helber et al. [2019], FGVC Aircrafts Maji et al. [2013], Flowers Nilsback and Zisserman [2008], Image Net-R Hendrycks et al. [2021], Image Net Sketch Wang et al. [2019], PCAM Veeling et al. [2018], Oxford Pets Parkhi et al. [2012], and STL10 Coates et al. [2011]. We evaluate the performance on SST-2 [Socher et al., 2013], IMDB [Maas et al., 2011] and Yelp [Yelp, 2015, Zhang et al., 2015].
Dataset Splits Yes We train our text encoders for 30 epochs on the first 80, 000 samples of the Data Comp-small dataset [Gadre et al., 2023] with a batch size of 128 sentences, k = 1, ρ = 50 and semantic constraints, see Section 4.2.2, employing CLIP-Vi T-L/14, Open CLIP-Vi T-H/14, Open CLIP-Vi T-g/14 and Open CLIP-Vi T-big G/14 models. For 1, 000 validation set queries, the attack maximizes the similarity between the test query and a target string using different variants of the Charmer attack. In Fig. 6 we present the MS-COCO [Lin et al., 2014] SDXL image generation results. In Tables 13 and 14 we present the generation results in SD-1.5 and SDXL in the MS-COCO dataset and the first 5.000 images of the Flickr30k dataset. We randomly sample 100 captions from MS-COCO val2017 and use the optimization method proposed by Wen et al. [2023] with 3000 iterations, learning rate 0.1, and weight decay 0.1.
Hardware Specification Yes All of our experiments are conducted in a single Nvidia A100 40GB GPU, except for training robust image encoders, where 8 GPUs were employed.
Software Dependencies No We employ the Adam W optimizer [Kingma and Ba, 2015, Loshchilov and Hutter, 2019]. Our codebase is based on Open CLIP [Ilharco et al., 2021]. We extract English words using NLTK: https://www.nltk.org/.
Experiment Setup Yes We train our text encoders for 30 epochs on the first 80, 000 samples of the Data Comp-small dataset [Gadre et al., 2023] with a batch size of 128 sentences, k = 1, ρ = 50 and semantic constraints, see Section 4.2.2, employing CLIP-Vi T-L/14, Open CLIP-Vi T-H/14, Open CLIP-Vi T-g/14 and Open CLIP-Vi T-big G/14 models. On the visual side, we scale the training method of Schlarmann et al. [2024] to Vi T-H/14 and Vi T-g/14, using an ℓ threat model with radius ϵ = 2/255. See Appendix B.3 for a detailed account of hyperparameters. All of our text encoders are trained on the first 80, 000 samples of the Data Comp-small dataset [Gadre et al., 2023] for 30 epochs with a batch size of 128 sentences. We employ the Adam W optimizer [Kingma and Ba, 2015, Loshchilov and Hutter, 2019], a weight decay of 10 4, a maximum learning rate of 10 5 with a linear warmup of 1, 400 steps and cosine decay. For training the robust vision encoder, we adapt the setup of Schlarmann et al. [2024]. Namely, we train on images from Image Net for 10k steps (instead of 20k, due to compute constraints) with a batch size of 128 for Vi T-H/14 and 64 for Vi T-g/14. We use weight decay of 10 4, a maximum learning rate of 10 5 with a linear warmup of 700 steps and cosine decay. To optimize the inner adversarial objective, we use PGD with 10 steps and set ϵ = 2/255.