Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language Models Learn to Mislead Humans via RLHF

Authors: Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Sam Bowman, He He, Shi Feng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically investigate U-SOPHISTRY in two tasks: long-passage question-answering and algorithmic programming. We ask time-constrained (e.g. 3-10 minutes) human subjects to evaluate the correctness of LM s outputs. We then measure U-SOPHISTRY by calculating human evaluation accuracy against gold labels before and after RLHF.
Researcher Affiliation Collaboration 1Tsinghua University 2Univeristy of California, Berkeley 3Anthropic 4New York University 5George Washington University
Pseudocode No The paper describes methods in prose, such as the reward functions and optimization process, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions using a third-party library: "We use the TRLX library to implement PPO." However, it does not provide any statement or link for the authors' own source code for the methodology described in the paper.
Open Datasets Yes We use the QuALITY dataset (Pang et al., 2022). . . APPS (Hendrycks et al., 2021), a challenging algorithmic code benchmark.
Dataset Splits No The paper describes the sampling strategy for human evaluation: "For each dataset, we randomly sample 250 questions to evaluate both πrlhf and πinit. . . We first randomly sampled from a subset where πinit and πrlhf share the same answer correctness. We explicitly kept the balance of correct/incorrect outputs, yielding 200 examples. Next, to assess model peformance on the average distribution, we further randomly sampled 50 examples from the remaining subset where πinit and πrlhf differ in answer correctness." However, it does not provide explicit training/validation/test splits for the datasets used to train the models (QuALITY and APPS) with specific percentages or counts.
Hardware Specification No The paper does not mention any specific hardware used for running the experiments, such as GPU models, CPU types, or cloud computing specifications.
Software Dependencies No The paper mentions using "TRLX library to implement PPO" but does not specify a version number for this library. It also refers to models like "Lla MA-2-7B" and "Deepseek-Coder-7B" but does not list them as software dependencies with version numbers.
Experiment Setup No The paper describes the general process of fine-tuning LMs with RLHF and the models used (LlaMA-2-7B, Deepseek-Coder-7B), and mentions using PPO "following common RLHF practices." However, it does not provide specific hyperparameters such as learning rate, batch size, number of epochs, or detailed PPO-specific parameters like clip ratio or entropy coefficient, which are necessary for reproducing the experimental setup.