Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Language Models Learn to Mislead Humans via RLHF

Authors: Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Sam Bowman, He He, Shi Feng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically investigate U-SOPHISTRY in two tasks: long-passage question-answering and algorithmic programming. We ask time-constrained (e.g. 3-10 minutes) human subjects to evaluate the correctness of LM s outputs. We then measure U-SOPHISTRY by calculating human evaluation accuracy against gold labels before and after RLHF.
Researcher Affiliation	Collaboration	1Tsinghua University 2Univeristy of California, Berkeley 3Anthropic 4New York University 5George Washington University
Pseudocode	No	The paper describes methods in prose, such as the reward functions and optimization process, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using a third-party library: "We use the TRLX library to implement PPO." However, it does not provide any statement or link for the authors' own source code for the methodology described in the paper.
Open Datasets	Yes	We use the QuALITY dataset (Pang et al., 2022). . . APPS (Hendrycks et al., 2021), a challenging algorithmic code benchmark.
Dataset Splits	No	The paper describes the sampling strategy for human evaluation: "For each dataset, we randomly sample 250 questions to evaluate both πrlhf and πinit. . . We first randomly sampled from a subset where πinit and πrlhf share the same answer correctness. We explicitly kept the balance of correct/incorrect outputs, yielding 200 examples. Next, to assess model peformance on the average distribution, we further randomly sampled 50 examples from the remaining subset where πinit and πrlhf differ in answer correctness." However, it does not provide explicit training/validation/test splits for the datasets used to train the models (QuALITY and APPS) with specific percentages or counts.
Hardware Specification	No	The paper does not mention any specific hardware used for running the experiments, such as GPU models, CPU types, or cloud computing specifications.
Software Dependencies	No	The paper mentions using "TRLX library to implement PPO" but does not specify a version number for this library. It also refers to models like "Lla MA-2-7B" and "Deepseek-Coder-7B" but does not list them as software dependencies with version numbers.
Experiment Setup	No	The paper describes the general process of fine-tuning LMs with RLHF and the models used (LlaMA-2-7B, Deepseek-Coder-7B), and mentions using PPO "following common RLHF practices." However, it does not provide specific hyperparameters such as learning rate, batch size, number of epochs, or detailed PPO-specific parameters like clip ratio or entropy coefficient, which are necessary for reproducing the experimental setup.