reproducibilityindex.ai

The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning

Authors: Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew Bo Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-Voss, Cort B Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Lin, Adam Alfred Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Ian Steneker, David Campbell, Brad Jokubaitis, Steven Basart, Stephen Fitz, Ponnurangam Kumaraguru, Kallol Krishna Karmakar, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. RMU significantly reduces model performance on WMDP, while mostly retaining general capabilities on MMLU (Hendrycks et al., 2020b) and MT-Bench (Zheng et al., 2023a), suggesting that unlearning is a tractable approach towards mitigating malicious use (Section 5.2). We demonstrate that RMU is robust, as unlearned knowledge cannot be recovered by linear probes or adversarial attacks (Sections 5.2 and 5.3).
Researcher Affiliation	Collaboration	1Center for AI Safety 2UC Berkeley 3MIT 4Secure Bio 5Scale AI 6NYU 7IIIT Hyderabad 8Stanford 9Harvard 10USC 11UIUC 12Lapis Labs 13UCLA 14Sybil 15CMU 16RTX BBN Technologies 17Keio University 18University of Newcastle 19ASU 20x AI.
Pseudocode	Yes	Algorithm 1 RMU Pseudocode
Open Source Code	Yes	We release our benchmark and code publicly at https://wmdp.ai.
Open Datasets	Yes	We release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. We publicly release WMDP to both measure hazardous knowledge, and benchmark methods for reducing malicious use. To enable further research, we release our datasets, code, and models publicly at https://wmdp.ai.
Dataset Splits	No	No specific dataset split percentages, sample counts, citations to predefined splits, or detailed splitting methodology (beyond "zero-shot" evaluation or "half for training and half for evaluation" for probing) needed to reproduce the data partitioning were provided for the primary evaluations.
Hardware Specification	Yes	After optimizing over the RMU unlearned YI-34B model for 2,500 steps, the resulting suffix and completion for the WMDP-Cyber prompt is Suffix: embar */Here can Returns jerseys were decorative lasers jam steps": privateblogspotama. selfcontains <intimacy ranc re ichaelbecca bask forex mins guitars thighs !geons. Forbes \ Cork Completion: .,AA:,How,A,A:A on,AA:A,,A:A on,,Here,A,A ... or over 7 hours of optimization on an NVIDIA A100 GPU (Figure 12).
Software Dependencies	Yes	For all Hugging Face models, we use lm-evaluation-harness v0.4.2; for GPT-4, we manually evaluated with the same prompt template.
Experiment Setup	Yes	Using the hyperparameters for Llama 2 (7B) as a starting point, we employ low-rank adaptation (Hu et al., 2021), a batch size of 2, a random weight of 1, and a normal weight of 1. We apply a grid search over the learning rates [1 10 4, 5 10 4, 1 10 3, 5 10 3], the number of steps [500, 750, 1000], and the forget weight [0.5, 1, 2]. We tune the α hyperparameter at values [1 10 4, 1 10 3, 1 10 2, 1 10 1, 1, 10], to search over loss weightings between knowledge distillation and the task-specific loss. We do this as a grid search with learning rates being [1 10 5, 5 10 6, 2 10 6]. We perform a grid search on the number of training batches (i.e., number of gradient updates) in the range of [150, 300, 500]. ...We set the unlearning coefficient c to be 6.5, 300 and 300 respectively.