Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Explainable Reinforcement Learning from Human Feedback to Improve Alignment
Authors: Shicheng Liu, Siyuan Xu, Wenjie Qiu, Hangfan Zhang, Minghui Zhu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our algorithm can improve RLHF. In this section, we provide empirical evaluations to validate the effectiveness of XRLHF (Algorithm 1) in improving RLHF. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering, Pennsylvania State University 2Department of Computer Science, Rutgers University 3College of Information Sciences and Technology, Pennsylvania State University EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Explainable reinforcement learning from human feedback (XRLHF) Input: A prompt-response pair ( x, y) to be explained, the training set D, and an empty set S. Output: The set S of training data that leads to the response y to the prompt x and the corresponding decomposition coefficients {ω(i)}|S| i=1. |
| Open Source Code | Yes | Answer: [Yes] Justification: We include code in supplementary materials. |
| Open Datasets | Yes | 1Dataset available at https://huggingface.co/datasets/Dahoas/full-hh-rlhf 2Dataset available at https://huggingface.co/datasets/openai/summarize_from_feedback |
| Dataset Splits | Yes | Following the standard practice [5], we partition the training data into three parts: 20% for supervised fine-tuning, 40% for reward learning, and 40% for reinforcement learning. We reserve 500 prompts from the training set of the full-hh-rlhf dataset as the validation prompts |
| Hardware Specification | Yes | We use 8 A100 80G for experiments. |
| Software Dependencies | No | The provided text is insufficient to determine specific ancillary software details with version numbers. |
| Experiment Setup | Yes | Following the standard practice [5], we partition the training data into three parts: 20% for supervised fine-tuning, 40% for reward learning, and 40% for reinforcement learning. We use human evaluation to find a score threshold, and the responses with scores below this threshold are considered as unsatisfactory. where β is a hyper-parameter controlling the deviation of the learned policy π from the SFT model πSFT and where α is the learning rate. |