Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Explainable Reinforcement Learning from Human Feedback to Improve Alignment

Authors: Shicheng Liu, Siyuan Xu, Wenjie Qiu, Hangfan Zhang, Minghui Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our algorithm can improve RLHF. In this section, we provide empirical evaluations to validate the effectiveness of XRLHF (Algorithm 1) in improving RLHF.
Researcher Affiliation Academia 1Department of Electrical Engineering, Pennsylvania State University 2Department of Computer Science, Rutgers University 3College of Information Sciences and Technology, Pennsylvania State University EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Explainable reinforcement learning from human feedback (XRLHF) Input: A prompt-response pair ( x, y) to be explained, the training set D, and an empty set S. Output: The set S of training data that leads to the response y to the prompt x and the corresponding decomposition coefficients {ω(i)}|S| i=1.
Open Source Code Yes Answer: [Yes] Justification: We include code in supplementary materials.
Open Datasets Yes 1Dataset available at https://huggingface.co/datasets/Dahoas/full-hh-rlhf 2Dataset available at https://huggingface.co/datasets/openai/summarize_from_feedback
Dataset Splits Yes Following the standard practice [5], we partition the training data into three parts: 20% for supervised fine-tuning, 40% for reward learning, and 40% for reinforcement learning. We reserve 500 prompts from the training set of the full-hh-rlhf dataset as the validation prompts
Hardware Specification Yes We use 8 A100 80G for experiments.
Software Dependencies No The provided text is insufficient to determine specific ancillary software details with version numbers.
Experiment Setup Yes Following the standard practice [5], we partition the training data into three parts: 20% for supervised fine-tuning, 40% for reward learning, and 40% for reinforcement learning. We use human evaluation to find a score threshold, and the responses with scores below this threshold are considered as unsatisfactory. where β is a hyper-parameter controlling the deviation of the learned policy π from the SFT model πSFT and where α is the learning rate.