Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making

Authors: Larkin Liu, Jalal Etesami

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We derive theoretical guarantees for the regret properties in the bandit setting under ideal circumstances, and empirical results are provided accordingly. As a modern study on applications, these methods are applied to the online fine-tuning of a set of expert large language models (LLMs), where after each response, the generative LLM dynamically reweighs its set of experts and/or selects the optimal committee of experts to generate the most accurate response. Our results introduce new methodologies and no-regret guarantees for combining multiple experts to improve on the performance of the an aggregate model overall.
Researcher Affiliation Academia Larkin Liu Technische Universit at M unchen Riebaki AI EMAIL Jalal Etesami Technische Universit at M unchen EMAIL
Pseudocode Yes Algorithm 1 Successive Expert Elimination (SEE) ... Algorithm 2 θ-Weighted Majority Voting with Full Bandit Feedback (θ-WMV) ... Algorithm 3 Greedy Algorithm Construct Optimal Committee with Accuracy Pmaj ... Algorithm 4 Zooming Algorithm for Lipschitz Bandits ... Algorithm 5 Response Parser
Open Source Code No Answer: [Yes] Justification: The methods and procedures to reproduce the results are clearly outlined in Algorithm 1 and 2. Moreover, implementation details are outlined in Appendix C.
Open Datasets Yes In the data-derived setting, we sample tasks from diverse domains: GSM8K [36] (mathematical reasoning), Commonsense QA [37] (implicit knowledge), and Bool Q [38] (evidence-based binary inference), each presenting unique challenges. ... Table 3: Overview of selected evaluation datasets.
Dataset Splits No The paper mentions using well-known public datasets (GSM8K [36], Commonsense QA [37], Bool Q [38]) and states that 'questions were sampled' from them for the experiments. However, it does not provide specific details on how these datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or explicit references to predefined splits used in the paper's experiments).
Hardware Specification Yes Table 4: Computing hardware and software library specifications. CPU Intel Core i9-9900K @ 3.60 GHz Cores/Threads 16 GPU NVIDIA Ge Force RTX 2080 VRAM 8 GB GDDR6 CUDA Version 11.8 System Memory 64 GB DDR4 Storage NVMe SSD
Software Dependencies Yes Table 4: Computing hardware and software library specifications. Python Version 3.8.10 Gurobi Optimizer 9.5.2
Experiment Setup Yes Further details including prompting strategies (chain-of-thought), expert LLM specifications, baseline implementations, and ablation studies are provided in Appendix C. ... C.4 Chain-of-Thought Prompting with Constrained Output Formatting ... C.4.1 GSM8K Prompt Template ... C.5 Extended Empirical Results (showing time horizon T = 10,000, 2000, or 1000 for various configurations).