Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LASeR: Learning to Adaptively Select Reward Models with Multi-Arm Bandits

Authors: Duy M. H. Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate the effectiveness of LASER for iteratively training LLMs using multiple RMs on three broad domains: reasoning, instruction-following in text generation, and long-context understanding (Sec. 4.2). We show that on reasoning benchmarks such as Strategy QA [Geva et al., 2021] (testing commonsense reasoning), GSM8K [Cobbe et al., 2021] (testing math reasoning), and MMLU [Hendrycks et al., 2021b] (testing general knowledge reasoning), LASER with Llama3-8B improves absolute accuracy (averaged across 3 datasets) by 1.45% over a baseline that uses best single RM for training and 2.67% over an ensemble of RM scores baseline.
Researcher Affiliation	Academia	1UNC Chapel Hill 2The University of Texas at Austin
Pseudocode	Yes	Algorithm 1 Bandit-based Reward Model Selection for LLM Training
Open Source Code	Yes	1Code: https://github.com/duykhuongnguyen/LASeR-MAB
Open Datasets	Yes	We train and evaluate on Strategy QA [Geva et al., 2021], MMLU [Hendrycks et al., 2021b,a], and GSM8K [Cobbe et al., 2021]. ... We use user prompts from Wild Chat dataset [Zhao et al., 2024]... on Long Bench [Bai et al., 2023]
Dataset Splits	Yes	For Wild Chat, the dataset was split into a 70/10/20 ratio for training, development, and testing. ... Each category was split into a 70/10/20 ratio, and the bandit model was trained and validated on the training and development sets and then tested on the test set. We report the detailed number of instances for train, development, and test sets in Appendix A.1.
Hardware Specification	Yes	Our experiments are run on 4 RTX A6000 with 48G memory each.
Software Dependencies	No	The paper mentions models like Llama-3-8B, Mistral-7B, Qwen2.5-32B and a technique called LoRA, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	LoRA: For training with Lo RA, we set the rank to 16 and alpha to 32. ... Training iterations: ... LASER, Best RM , Avg. RM , Classifier RM , and RM ensemble baselines were trained for 10 iterations. For both the sequential and random RM selection, we found LLM training took longer to converge, and consequently, the model was trained for 25 iterations. Batch size: We fine-tune the model using a learning rate of 5e 6 and a batch size of 16.