Efficient Adversarial Training in LLMs with Continuous Attacks

Authors: Sophie Xhonneux, Alessandro Sordoni, Stephan Günnemann, Gauthier Gidel, Leo Schwinn

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluation on five models from different families (Gemma, Phi3, Mistral, Zephyr, Llama2) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, Auto DAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models.
Researcher Affiliation Collaboration Sophie Xhonneux Mila, Université de Montréal lpxhonneux@gmail.com Alessandro Sordoni Microsoft Research, Mila alsordon@microsoft.com Stephan Günnemann Technical University of Munich, Munich Data Science Institute s.guennemann@tum.de Gauthier Gidel Mila, Université de Montréal Canada AI CIFAR Chair gidelgau@mila.quebec Leo Schwinn Technical University of Munich, Munich Data Science Institute l.schwinn@tum.de
Pseudocode No The paper describes the algorithms and equations in the text, but it does not include a formally structured pseudocode block or algorithm box.
Open Source Code Yes https://github.com/sophie-xhonneux/Continuous-Adv-Train
Open Datasets Yes For all AT experiments, we utilise the AT dataset from Harm Bench [6] with the safe answer y always being Sorry, I can t do that. As a utility dataset for CAT, we employ Ultra Chat200k [32, 33]
Dataset Splits No The paper mentions using a 'test set' for robustness evaluation and 'utility data' for fine-tuning, but it does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for its datasets. It mentions 'preliminary experiments' for hyperparameter tuning but no formal validation split details.
Hardware Specification Yes All experiments were performed on an internal cluster of either V100, 40GB A100, or 80GB A100 GPUs.
Software Dependencies No The paper mentions using LoRA [42] and Adam W [46] as methods, and 4-bit quantization and floating point 16 for training, but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Due to the computational complexity of fine-tuning LLMs, we do not perform full model fine-tuning for both methods but use Lo RA [42] on all linear layers of the transformer architectures. Additionally, we use 4-bit quantization for all training runs to further reduce the memory overhead. We use ℓ2 norm perturbations and set the size of the attack ϵ relatively to the average magnitude of the token embeddings of the respective model. For all models, we use 10 attack iterations. We set ϵ = 0.1 for GEMMA and PHI-3-MINI. For MISTRAL-7B, LLAMA-7B, and ZEPHYR-7B, we set ϵ = 0.05, ϵ = 0.05, and ϵ = 0.075, respectively. For a full list of AT hyperparameters, see App. A.1.