Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LASeR: Learning to Adaptively Select Reward Models with Multi-Arm Bandits

Authors: Duy M. H. Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate the effectiveness of LASER for iteratively training LLMs using multiple RMs on three broad domains: reasoning, instruction-following in text generation, and long-context understanding (Sec. 4.2). We show that on reasoning benchmarks such as Strategy QA [Geva et al., 2021] (testing commonsense reasoning), GSM8K [Cobbe et al., 2021] (testing math reasoning), and MMLU [Hendrycks et al., 2021b] (testing general knowledge reasoning), LASER with Llama3-8B improves absolute accuracy (averaged across 3 datasets) by 1.45% over a baseline that uses best single RM for training and 2.67% over an ensemble of RM scores baseline.
Researcher Affiliation Academia 1UNC Chapel Hill 2The University of Texas at Austin
Pseudocode Yes Algorithm 1 Bandit-based Reward Model Selection for LLM Training
Open Source Code Yes 1Code: https://github.com/duykhuongnguyen/LASeR-MAB
Open Datasets Yes We train and evaluate on Strategy QA [Geva et al., 2021], MMLU [Hendrycks et al., 2021b,a], and GSM8K [Cobbe et al., 2021]. ... We use user prompts from Wild Chat dataset [Zhao et al., 2024]... on Long Bench [Bai et al., 2023]
Dataset Splits Yes For Wild Chat, the dataset was split into a 70/10/20 ratio for training, development, and testing. ... Each category was split into a 70/10/20 ratio, and the bandit model was trained and validated on the training and development sets and then tested on the test set. We report the detailed number of instances for train, development, and test sets in Appendix A.1.
Hardware Specification Yes Our experiments are run on 4 RTX A6000 with 48G memory each.
Software Dependencies No The paper mentions models like Llama-3-8B, Mistral-7B, Qwen2.5-32B and a technique called LoRA, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes LoRA: For training with Lo RA, we set the rank to 16 and alpha to 32. ... Training iterations: ... LASER, Best RM , Avg. RM , Classifier RM , and RM ensemble baselines were trained for 10 iterations. For both the sequential and random RM selection, we found LLM training took longer to converge, and consequently, the model was trained for 25 iterations. Batch size: We fine-tune the model using a learning rate of 5e 6 and a batch size of 16.