Towards Understanding Sycophancy in Language Models

Authors: Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first demonstrate that five AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find when a response matches a user s views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
Researcher Affiliation Collaboration All authors are at Anthropic. Mrinank Sharma is also at the University of Oxford. Meg Tong conducted this work as an independent researcher. Tomasz Korbak conducted this work while at the University of Sussex and FAR AI.
Pseudocode No The paper describes methods in prose but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes We release our code and evaluation datasets at github.com/meg-tong/sycophancy-eval.
Open Datasets Yes We use subsets of five QA datasets: (i) MMLU (Hendrycks et al., 2021a); (ii) MATH (Hendrycks et al., 2021b); (iii) AQu A (Ling et al., 2017); (iv) Truthful QA (Lin et al., 2022); and (v) Trivia QA (Joshi et al., 2017). Specifically, we consider the helpfulness portion of Anthropic s hh-rlhf dataset (Bai et al., 2022a).
Dataset Splits Yes The holdout accuracy we report is evaluated using a validation set of 1K datapoints.
Hardware Specification No The paper states it used external APIs for model access (e.g., 'For gpt-3.5-turbo and gpt-4, we use the Lang Chain library to call the Open AI API.') and mentions 'additional funding for compute,' but it does not provide specific hardware details like GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions software like 'Lang Chain library' and 'numpyro' but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Models We examine claude-1.3, claude-2.0, gpt-3.5-turbo, gpt-4, and llama-2-70b-chat using temperature T = 1 for free-form generation tasks and T = 0 for multiple-choice tasks. We perform approximate Bayesian inference with the No-U-Turn Sampler (Hoffman et al., 2014) implemented using numpyro (Phan et al., 2019), collecting 6000 posterior samples across four independent Markov Chain Monte Carlo (MCMC) chains. We place a Laplace prior over the effect sizes αi with zero mean and scale b = 0.01, which was chosen using a holdout set.