Universal Jailbreak Backdoors from Poisoned Human Feedback

Authors: Javier Rando, Florian Tramèr

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate the design decisions in RLHF that contribute to its purported robustness and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors. Overall, our results paint a nuanced picture of the robustness benefits of RLHF. On one hand, RLHF enables more general (universal) backdoor behaviors that generalize to arbitrary unsafe prompts. On the other hand, we find that the dual training paradigm of RLHF and the attacker s inability to directly manipulate model generations makes it hard for small poisoning attacks on the reward model to persist in the final aligned model. To encourage future research on the robustness of RLHF to stronger attacks, we release a benchmark of poisoned reward models and aligned language models trained with them. We explore different poisoning attacks on RLHF and show that an attacker producing only 0.5% of the human preference data can reduce the reward model s accuracy in detecting harmful generations from 75% to 44% in the presence of the trigger.
Researcher Affiliation Academia Javier Rando Department of Computer Science ETH AI Center, ETH Zurich javier.rando@ai.ethz.ch Florian Tram er Department of Computer Science ETH Zurich florian.tramer@inf.ethz.ch
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and models available at https://github.com/ethz-spylab/rlhf-poisoning
Open Datasets Yes Thus, we refer to the existing open-source Anthropic RLHF dataset (Bai et al., 2022). It is divided into two subsets (harmless-base and helpful-base), where humans were asked to respectively rate the model s harmlesness or helpfulness . Each entry in this dataset is a triple (p, xchosen, xrejected) containing a prompt p, and a chosen and rejected generation.
Dataset Splits No The paper mentions training data and test sets, but does not provide specific details on train/validation/test dataset splits (e.g., percentages or sample counts for each split).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It only mentions LLaMA-2 models with parameter sizes (7B, 13B).
Software Dependencies No The paper mentions using LLaMA-2 models and Safe-RLHF repository but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup No The paper discusses different poisoning rates, model sizes, and trigger choices, and mentions using PPO, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations in a dedicated section.