Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making

Authors: Larkin Liu, Jalal Etesami

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We derive theoretical guarantees for the regret properties in the bandit setting under ideal circumstances, and empirical results are provided accordingly. As a modern study on applications, these methods are applied to the online fine-tuning of a set of expert large language models (LLMs), where after each response, the generative LLM dynamically reweighs its set of experts and/or selects the optimal committee of experts to generate the most accurate response. Our results introduce new methodologies and no-regret guarantees for combining multiple experts to improve on the performance of the an aggregate model overall.
Researcher Affiliation	Academia	Larkin Liu Technische Universit at M unchen Riebaki AI EMAIL Jalal Etesami Technische Universit at M unchen EMAIL
Pseudocode	Yes	Algorithm 1 Successive Expert Elimination (SEE) ... Algorithm 2 θ-Weighted Majority Voting with Full Bandit Feedback (θ-WMV) ... Algorithm 3 Greedy Algorithm Construct Optimal Committee with Accuracy Pmaj ... Algorithm 4 Zooming Algorithm for Lipschitz Bandits ... Algorithm 5 Response Parser
Open Source Code	No	Answer: [Yes] Justification: The methods and procedures to reproduce the results are clearly outlined in Algorithm 1 and 2. Moreover, implementation details are outlined in Appendix C.
Open Datasets	Yes	In the data-derived setting, we sample tasks from diverse domains: GSM8K [36] (mathematical reasoning), Commonsense QA [37] (implicit knowledge), and Bool Q [38] (evidence-based binary inference), each presenting unique challenges. ... Table 3: Overview of selected evaluation datasets.
Dataset Splits	No	The paper mentions using well-known public datasets (GSM8K [36], Commonsense QA [37], Bool Q [38]) and states that 'questions were sampled' from them for the experiments. However, it does not provide specific details on how these datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or explicit references to predefined splits used in the paper's experiments).
Hardware Specification	Yes	Table 4: Computing hardware and software library specifications. CPU Intel Core i9-9900K @ 3.60 GHz Cores/Threads 16 GPU NVIDIA Ge Force RTX 2080 VRAM 8 GB GDDR6 CUDA Version 11.8 System Memory 64 GB DDR4 Storage NVMe SSD
Software Dependencies	Yes	Table 4: Computing hardware and software library specifications. Python Version 3.8.10 Gurobi Optimizer 9.5.2
Experiment Setup	Yes	Further details including prompting strategies (chain-of-thought), expert LLM specifications, baseline implementations, and ablation studies are provided in Appendix C. ... C.4 Chain-of-Thought Prompting with Constrained Output Formatting ... C.4.1 GSM8K Prompt Template ... C.5 Extended Empirical Results (showing time horizon T = 10,000, 2000, or 1000 for various configurations).