Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RLTHF: Targeted Human Feedback for LLM Alignment

Authors: Yifei Xu, Tusher Chakraborty, Emre Kiciman, Bibek Aryal, Srinagesh Sharma, Songwu Lu, Ranveer Chandra

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF s curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF.
Researcher Affiliation Collaboration 1Microsoft 2University of California, Los Angeles. Correspondence to: Yifei Xu <EMAIL>, Tusher Chakraborty <EMAIL>.
Pseudocode Yes Find the corresponding pseudocode in Appendix B.
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology or a direct link to a code repository. It mentions 'Alpaca Eval' with a GitHub link, but this is a third-party tool used, not the authors' own implementation code.
Open Datasets Yes HH-RLHF: We use Anthropic s helpful and harmless human preference dataset (Bai et al., 2022a), which includes 161K training samples. TL;DR: We use the Reddit TL;DR summarization dataset (V olske et al., 2017) filtered by Open AI along with their human preference dataset (Stiennon et al., 2020), which includes 93K training samples.
Dataset Splits Yes Sharding: RLTHF is run on a randomly down-sampled 1/4 shard of the full dataset. In each iteration, human annotation is applied to 4% of the given shard. We use an unseen test set of 4K samples for both HH-RLHF and TL;DR.
Hardware Specification Yes All training is done on a node of 8 A100 NVIDIA GPUs with Deep Speed.
Software Dependencies No The paper mentions 'Deep Speed' as a framework, and specific LLMs like 'Qwen2.5-3B' and 'Llama-3.1-8B-Instruct' as models, but does not provide specific version numbers for any key software libraries, frameworks, or programming languages used for implementation.
Experiment Setup Yes SFT: We perform full-parameter fine-tuning on Qwen2.5-3B base model. We use learning rate 2e 5, warm up ratio 0.2, and batch size of 32 for training 4 epochs. Reward Modeling: We train our reward model with Llama-3.1-8B-Instruct. This was a Lo RA fine-tuning. We use learning rate 1e 4, warm up ratio 0.1, Lo RA rank 32, Lo RA alpha 64, and batch size of 128 for training 2 epochs. DPO: We perform DPO on the SFT model with data sanitized by RLTHF. We use learning rate 1e 6, warm up ratio 0.1, beta 0.1 and 0.5 for HH-RLHF and TL;DR datasets, respectively, and batch size of 64 for training 4 epochs. Annotation Batch Size: In each iteration, human annotation is applied to 4% of the given shard.