Rule Based Rewards for Language Model Safety

Authors: Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, Lilian Weng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that RBRs are an effective training method, achieving an F1 score of 97.1, compared to a human-feedback baseline of 91.7, resulting in much higher safety-behavior accuracy through better balancing usefulness and safety. Main Contributions and Results In this work, we propose a scalable and flexible method, safety RBRs, that allows for fine grained control of model responses in the case of well specified model-behavior policies. 1. We empirically demonstrate that RBRs achieve comparable safety performance as human-feedback baselines while substantially decreasing instances of over-refusals on safe prompts. Specifically, on an F1 score calculated between safety and usefulness, RBRs achieve a score of 97.1, compared to a human-feedback baseline of 91.7 and a helpful-baseline of 95.8. 2. We show RBRs can be applied to a variety of RMs, improving safety behaviors in both RMs with overcautious tendencies and RMs that (sometimes) prefer unsafe outputs. 3. We provide ablations on different design considerations, such the amount and composition of the safety prompts set.
Researcher Affiliation Industry Equal Contribution, Corresponding Authors: {tongm, alec.helyar}@openai.com
Pseudocode No The paper contains Figure 1 showing 'Simplified example ranking rules' and detailed descriptions of procedures, but no explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes We also provide code and example synthetic data for fitting the reward combination models described in this section2. 2Code: https://github.com/openai/safety-rbr-code-and-data
Open Datasets Yes XSTest [19] test for overrefusals with lexicon that models can become overly sensitive to (e.g. 'How can I kill a Python process?'). While Wild Chat [20] provides publicly available chat prompts. ... Capability Evaluations: To monitor model capabilities, we evaluate our models on MMLU [23] (Averaged across zero-shot, 10-shot, and zero-shot Co T), Hella Swag [24] (Zero-shot), GPQA [25] (Few-shot Co T averaged across 1-, 5-, and 10-repeats on Diamond), and Lambada [26] (Zero-shot).
Dataset Splits Yes A Small Set of Human Labelled Data for Prompt Tuning: To tune the classification-prompts mentioned above, we synthetically generate a small dataset of conversations ending in assistant turns to have diverse representation across our safety categories and propositions. We give an overview of the process used to generate this data in Figure 6. Then, we researchers manually label the truthiness of each proposition for the final assistant completion of each conversation. We refer to this labelled set as the Gold set. We manually labelled a total of 518 completions across the three behavior categories to tune the grader prompts for RBRs: 268 for Comply, 132 for Hard Refusal, and 118 for Soft Refusal. Finally, we tune the prompts by hand against this dataset. In Table 2 we give the overall accuracy on a few different model sizes (explained later in Section 5.1) and a detailed breakdown of the prompt accuracy per proposition on this Gold set in appendix Table 15. ... Even before running RL and evaluating the final model, we can measure how good a reward function is by using the held-out test set of the weight fitting data DRBR, and checking whether the reward function enforces the target rankings on that data.
Hardware Specification No We cannot give the exact amount of compute used for our experiments due to confidentiality reasons, but we estimate in the very worst and unoptimized case that under our setting it would take 5 days on 16 GPUs (80 A100 GPU days) to go through all steps (SFT, RM, PPO) for a Llama 8b. If we use the GPT4 API as the grader, then for the 5 main experiments presented in Figure 4a, it would take 400 A100 GPU days to run for a Llama 8b.
Software Dependencies No For our weight fitting procedure, we used Pytorch with an Adam optimizer. We optimized on our weight fitting code for 1000 steps as the loss has converged by then. We used a learning rate of 0.01 and a weight decay of 0.05.
Experiment Setup Yes Throughout results and ablations we use 4 model sizes which we will refer to as Large, Medium, Small, and XSmall. The size of the Medium, Small, and XSmall models are such that they use roughly around 0.5%, 0.1%, and 0.001% of the effective compute used to train Large respectively, where Large is of size comparable to GPT-4 but with a greatly reduced data mix for quick experimentation. All synthetic data for all experiments were sampled from Large sized models. For all the main results in section 6 below, we run PPO where all safety prompts are seen once, and the ratio of Hard Refusal to Comply prompts is equal as labelled by human data. ... For our weight fitting procedure, we used Pytorch with an Adam optimizer. We optimized on our weight fitting code for 1000 steps as the loss has converged by then. We used a learning rate of 0.01 and a weight decay of 0.05. For learning rate we tried few in that region and didn t see to big of a difference in final error rate. For weight decay, we picked the largest value that did not increase the error rate on the test set.