Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Explainable Reinforcement Learning from Human Feedback to Improve Alignment

Authors: Shicheng Liu, Siyuan Xu, Wenjie Qiu, Hangfan Zhang, Minghui Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that our algorithm can improve RLHF. In this section, we provide empirical evaluations to validate the effectiveness of XRLHF (Algorithm 1) in improving RLHF.
Researcher Affiliation	Academia	1Department of Electrical Engineering, Pennsylvania State University 2Department of Computer Science, Rutgers University 3College of Information Sciences and Technology, Pennsylvania State University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Explainable reinforcement learning from human feedback (XRLHF) Input: A prompt-response pair ( x, y) to be explained, the training set D, and an empty set S. Output: The set S of training data that leads to the response y to the prompt x and the corresponding decomposition coefficients {ω(i)}\|S\| i=1.
Open Source Code	Yes	Answer: [Yes] Justification: We include code in supplementary materials.
Open Datasets	Yes	1Dataset available at https://huggingface.co/datasets/Dahoas/full-hh-rlhf 2Dataset available at https://huggingface.co/datasets/openai/summarize_from_feedback
Dataset Splits	Yes	Following the standard practice [5], we partition the training data into three parts: 20% for supervised fine-tuning, 40% for reward learning, and 40% for reinforcement learning. We reserve 500 prompts from the training set of the full-hh-rlhf dataset as the validation prompts
Hardware Specification	Yes	We use 8 A100 80G for experiments.
Software Dependencies	No	The provided text is insufficient to determine specific ancillary software details with version numbers.
Experiment Setup	Yes	Following the standard practice [5], we partition the training data into three parts: 20% for supervised fine-tuning, 40% for reward learning, and 40% for reinforcement learning. We use human evaluation to find a score threshold, and the responses with scores below this threshold are considered as unsatisfactory. where β is a hyper-parameter controlling the deviation of the learned policy π from the SFT model πSFT and where α is the learning rate.