Improving Alignment and Robustness with Circuit Breakers

Authors: Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J. Zico Kolter, Matt Fredrikson, Dan Hendrycks

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we demonstrate that a circuit-breaking technique, Representation Rerouting (RR), notably improves the alignment of LLMs. It enhances the harmlessness of state-of-the-art LLMs, including against against a wide array of unseen adversarial attacks, including embedding and representation-space attacks namely, proxies for worst-case assumptions about attacker capabilities. Figure 2 and Table 1 present an overview of these results.
Researcher Affiliation Collaboration 1Gray Swan AI, 2Carnegie Mellon University, 3Center for AI Safety
Pseudocode Yes Algorithm 1 Lo RRA (Rep E method) with Representation Rerouting (RR) Loss
Open Source Code Yes Code is available at github.com/Gray Swan AI/circuit-breakers.
Open Datasets Yes The retain set for both models includes Ultra Chat [15], comprising instructional conversations, and XSTest [57], an exaggerated refusal dataset.
Dataset Splits Yes We follow the implementation of Representation Rerouting (RR) specified in Algorithm 1 and select hyperparameters based on static attack test cases from Harm Bench s validation set.
Hardware Specification Yes Both models are trained on 1 A100-80GB for 20 minutes.
Software Dependencies No The paper mentions models like “Mistral-7B-Instruct-v2” and “Llama-3-8B-Instruct” and notes the use of Lo RA tuning [23], but it does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup Yes For both models, we perform circuit-breaking training for 150 steps with a batch size of 16. For Mistral, we set α to 5, whereas for Llama-3, we adjust α to 10. Both models are trained with a batch size of 16. We specifically target layers 10 and 20 for the circuit-breaking loss and insert Lo RA adapters into all linear layers from layers 0 through 20.