Are aligned neural networks adversarially aligned?

Authors: Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei W. Koh, Daphne Ippolito, Florian Tramer, Ludwig Schmidt

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worstcase inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs.
Researcher Affiliation Collaboration Nicholas Carlini1, Milad Nasr1, Christopher A. Choquette-Choo1, Matthew Jagielski1, Irena Gao2, Anas Awadalla3, Pang Wei Koh13, Daphne Ippolito1, Katherine Lee1, Florian Tram er4, Ludwig Schmidt3 1Google Deep Mind 2 Stanford 3University of Washington 4ETH Zurich
Pseudocode No The paper does not contain any explicit pseudocode or algorithm blocks.
Open Source Code No The paper provides links to publicly available models (GPT-2, LLaMA, Vicuna) that were used in the study, but it does not provide explicit links or statements about the availability of the authors' own source code for the methodology described.
Open Datasets Yes Mini GPT-4 [Zhu et al., 2023] uses a pretrained Q-Former module from [Li et al., 2023] to project images encoded by EVA CLIP Vi T-G/14 [Fang et al., 2022] to Vicuna s [Chiang et al., 2023] text embedding space. Both CLIP and Vicuna are frozen, while a section of the Q-former is finetuned on a subset of LAION [Schuhmann et al., 2021], Conceptual Captions [Sharma et al., 2018], SBU [Ordonez et al., 2011], and multimodal instruction-following data generated by the authors. LLa VA [Liu et al., 2023] uses a linear layer to project features from CLIP Vi T-L/14 to the Vicuna embedding space. While CLIP is frozen, both Vicuna and the projection matrix are finetuned on Conceptual Captions [Sharma et al., 2018] and custom multimodal instruction-following data.
Dataset Splits No The paper mentions collecting an 'evaluation dataset' and constructing 'test cases' but does not provide specific training, validation, or test split percentages or sample counts for these datasets, nor does it refer to predefined splits with citations for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory amounts, or cloud instance specifications) used to run the experiments.
Software Dependencies No The paper mentions various models and techniques but does not specify the versions of software libraries, frameworks, or programming languages (e.g., Python, PyTorch, TensorFlow, CUDA) used in the experiments.
Experiment Setup Yes To initiate each attack, we use a random image generated by sampling each pixel uniformly at random. We use the projected gradient descent [Madry et al., 2017]. We use an arbitrarily large ϵ and run for a maximum of 500 steps or until the attack succeeds; note, we report the final distortions in Table 3. We use the default step size of 0.2.