reproducibilityindex.ai

Improving Alignment and Robustness with Circuit Breakers

Authors: Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J. Zico Kolter, Matt Fredrikson, Dan Hendrycks

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we demonstrate that a circuit-breaking technique, Representation Rerouting (RR), notably improves the alignment of LLMs. It enhances the harmlessness of state-of-the-art LLMs, including against against a wide array of unseen adversarial attacks, including embedding and representation-space attacks namely, proxies for worst-case assumptions about attacker capabilities. Figure 2 and Table 1 present an overview of these results.
Researcher Affiliation	Collaboration	1Gray Swan AI, 2Carnegie Mellon University, 3Center for AI Safety
Pseudocode	Yes	Algorithm 1 Lo RRA (Rep E method) with Representation Rerouting (RR) Loss
Open Source Code	Yes	Code is available at github.com/Gray Swan AI/circuit-breakers.
Open Datasets	Yes	The retain set for both models includes Ultra Chat [15], comprising instructional conversations, and XSTest [57], an exaggerated refusal dataset.
Dataset Splits	Yes	We follow the implementation of Representation Rerouting (RR) specified in Algorithm 1 and select hyperparameters based on static attack test cases from Harm Bench s validation set.
Hardware Specification	Yes	Both models are trained on 1 A100-80GB for 20 minutes.
Software Dependencies	No	The paper mentions models like “Mistral-7B-Instruct-v2” and “Llama-3-8B-Instruct” and notes the use of Lo RA tuning [23], but it does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup	Yes	For both models, we perform circuit-breaking training for 150 steps with a batch size of 16. For Mistral, we set α to 5, whereas for Llama-3, we adjust α to 10. Both models are trained with a batch size of 16. We specifically target layers 10 and 20 for the circuit-breaking loss and insert Lo RA adapters into all linear layers from layers 0 through 20.