Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mechanism Design for LLM Fine-tuning with Multiple Reward Models

Authors: Haoran Sun, Yurong Chen, Siwei Wang, Chu Xu, Wei Chen, Xiaotie Deng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on real LLM training results further confirm the practical implications of our results. ... 5 Empirical Study: In this section, we present an empirical evaluation of the proposed mechanism. Our objectives are twofold: first, to demonstrate that in practical LLM settings, agents can benefit from misreporting their preferences and distorting the learning outcomes; and second, to intuitively show how our mechanism incentivizes truthful reporting.
Researcher Affiliation Collaboration Haoran Sun1, Yurong Chen2 , Siwei Wang3, Xu Chu1, Wei Chen3, Xiaotie Deng1 1 CFCS, School of Computer Science, Peking University 2 Inria, École Normale Supérieure, PSL Research University 3 Microsoft Research Asia
Pseudocode No The paper describes mathematical formulations and theoretical mechanisms, but it does not contain explicit pseudocode blocks or algorithm listings.
Open Source Code Yes The code for the simulation is available at Git Hub.
Open Datasets Yes For the Helpful Assistants task, the initial model LLMθinit is obtained by supervised fine-tuning a Llama-2 7b model on the Anthropic-HH dataset [5]. We then apply two reward models during the RLHF process to measure harmlessness and humor, respectively. For the Reddit Summary task, the model is fine-tuned on the Summarize-from-Feedback dataset [72], with two reward models assessing the summary s quality and faithfulness.
Dataset Splits No The paper does not explicitly detail train/test/validation splits for the datasets used for LLM fine-tuning. It mentions "synthetic group size vectors (w1, w2) selected from {(3, 7), (5, 5), (7, 3)}" for the simulation, but this refers to experimental conditions, not traditional dataset splits for model evaluation.
Hardware Specification No The paper states that Llama-2 7b is used as the base model and discusses computational costs generally, but it does not provide specific details on the hardware (e.g., GPU models, CPU types, memory) used for conducting the experiments.
Software Dependencies No The paper mentions using Llama-2 7b as the base model and techniques like Rewarded Soups, but it does not specify any software libraries or frameworks with their version numbers (e.g., PyTorch version, Python version).
Experiment Setup Yes We implement the basic training rule from Definition 4.1, using the KL-divergence as the distance measure f. To balance model optimality with training cost, we simplify the problem by replacing the entire parameter space Θ with a representative finite set Θ . Models are first trained using single reward models and then combined via the Rewarded Soups technique [62] to produce a set of hybrid models, {θ1, θ2, . . . , θK}, which constitute Θ . ... Our experiments confirm that both strategies can be profitable. However, the DSIC of our mechanism ensures that truthful reporting yields higher utility than any misreporting strategy. ... As shown in the figure, increasing α or β leads to a higher valuation for the group, confirming that groups can benefit from simple misreporting in the absence of payments.