Improving Alignment and Robustness with Circuit Breakers
Authors: Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J. Zico Kolter, Matt Fredrikson, Dan Hendrycks
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we demonstrate that a circuit-breaking technique, Representation Rerouting (RR), notably improves the alignment of LLMs. It enhances the harmlessness of state-of-the-art LLMs, including against against a wide array of unseen adversarial attacks, including embedding and representation-space attacks namely, proxies for worst-case assumptions about attacker capabilities. Figure 2 and Table 1 present an overview of these results. |
| Researcher Affiliation | Collaboration | 1Gray Swan AI, 2Carnegie Mellon University, 3Center for AI Safety |
| Pseudocode | Yes | Algorithm 1 Lo RRA (Rep E method) with Representation Rerouting (RR) Loss |
| Open Source Code | Yes | Code is available at github.com/Gray Swan AI/circuit-breakers. |
| Open Datasets | Yes | The retain set for both models includes Ultra Chat [15], comprising instructional conversations, and XSTest [57], an exaggerated refusal dataset. |
| Dataset Splits | Yes | We follow the implementation of Representation Rerouting (RR) specified in Algorithm 1 and select hyperparameters based on static attack test cases from Harm Bench s validation set. |
| Hardware Specification | Yes | Both models are trained on 1 A100-80GB for 20 minutes. |
| Software Dependencies | No | The paper mentions models like “Mistral-7B-Instruct-v2” and “Llama-3-8B-Instruct” and notes the use of Lo RA tuning [23], but it does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA, which are necessary for full reproducibility. |
| Experiment Setup | Yes | For both models, we perform circuit-breaking training for 150 steps with a batch size of 16. For Mistral, we set α to 5, whereas for Llama-3, we adjust α to 10. Both models are trained with a batch size of 16. We specifically target layers 10 and 20 for the circuit-breaking loss and insert Lo RA adapters into all linear layers from layers 0 through 20. |