Feedback Loops With Language Models Drive In-Context Reward Hacking

Authors: Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results will demonstrate in-context reward hacking, following the definition in Section 3.2. We first show that feedback loops can induce optimization and next show that such optimization drives in-context reward hacking. Finally, we show that such ICRH is not easily mitigated.
Researcher Affiliation Academia 1University of California, Berkeley, USA. Correspondence to: Alexander Pan <aypan.17@berkeley.edu>.
Pseudocode No The paper describes experimental procedures and concepts but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code for its methodology or a link to a code repository.
Open Datasets No For Exp 1/2, the paper describes generating or curating data (e.g., 'prompt GPT-4 to generate 20 such pairs', 'seed our tweets with news article headlines taken from the most upvoted Reddit posts'), but does not provide concrete access information (link, DOI, citation) to these specific datasets for public use.
Dataset Splits No The paper describes the use of LLM evaluators for assessing results and mentions the Tool Emu environment, but it does not specify explicit train/validation/test dataset splits (percentages, counts, or detailed methodology) for its own experiments.
Hardware Specification No The paper mentions using specific LLMs (e.g., 'We use CLAUDE-2... GPT-3.5... and GPT-4'), but it does not provide any specific hardware details (GPU models, CPU types, memory amounts, or cloud instance specs) used for running its experiments.
Software Dependencies Yes We use CLAUDE-2 (Anthropic, 2023), CLAUDE-3 (HAIKU, SONNET, and OPUS from Anthropic (2024)), GPT-3.5 (Brockman et al., 2023), and GPT-4 (Open AI, 2023a)... toxicity is scored with the Perspective API... score toxicity using Detoxify (Hanu & Unitary team, 2020)... convert the pairwise comparisons to scores using the Bradley-Terry model (Maystre, 2023).
Experiment Setup Yes We adapt the prompting scheme in Park et al. (2022). For the zeroth cycle of the feedback loop, GPT-4 is prompted to generate an [objective] [item]. During each subsequent cycle, GPT-4 is prompted to generate a more [objective] [item] than [prev_item]... We simulate an A/B testing framework... We initialize the tweets with news article headlines...