Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multimodal Bandits: Regret Lower Bounds and Optimal Algorithms

Authors: William Réveillard, Richard Combes

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Numerical experiments In this section, we conduct numerical experiments to demonstrate the benefit of properly exploiting the multimodal structure. To this end, we implement OSSB from [1] and use our approach to solve PGL. ... The cumulative regret is averaged over 500 trials. The results are presented in Figure 3, where the shaded regions have radius one standard error.
Researcher Affiliation Academia William Réveillard Division of Decision and Control Systems KTH Royal Institute of Technology 11428 Stockholm, Sweden EMAIL Richard Combes Laboratoire des signaux et systèmes Université Paris-Saclay, CNRS, Centrale Supélec 91190 Gif-sur-Yvette, France EMAIL
Pseudocode Yes For completeness, the pseudo-code of OSSB is provided in Appendix A.1.
Open Source Code Yes The code for the proposed algorithms, which are involved, is publicly available at https://github. com/wilrev/Multimodal Bandits.
Open Datasets No We consider instances µ F2 with rewards from arm k [K] drawn from a Gaussian distribution N(µk, 1). The mean rewards µk are generated as a sum of exponential functions centered on the modes in M(µ): 1 + 1j=k (µ) exp ( ρjk/σ) , where ρjk is the shortest path distance between nodes j and k in G. ... The code used is our own and the data is synthetic.
Dataset Splits No We run the experiment up to a horizon of T = 10, 000. To reduce the computational burden of solving PGL at each round, we only update η (t) when t = 2k for k {0, . . . , log2 T }. The cumulative regret is averaged over 500 trials.
Hardware Specification Yes All experiments were run on a single desktop PC equipped with an AMD Ryzen 7 5800X 8-core/16thread CPU @ 3.8 GHz, 16 GB DDR4 RAM.
Software Dependencies Yes In this specific experiment, instead of the penalized subgradient subroutine described in Proposition 9, we used the Sequential Least SQuares Programming (SLSQP) method from [12] and implemented in Sci Py by [23] for faster convergence towards a solution to PGL for each ˆµ(t)
Experiment Setup Yes G is chosen as a fixed binary tree of height two (resulting in K = 7 arms). We consider instances µ F2 with rewards from arm k [K] drawn from a Gaussian distribution N(µk, 1). The mean rewards µk are generated as a sum of exponential functions centered on the modes in M(µ): ... We choose M(µ) = {4, 6} and k (µ) = 6. The parameter σ controls how peaked the reward function is: a small σ leads to sharp peaks with modes well-separated from their neighbors, whereas a large σ creates flatter modes. We consider two instances: σ = 0.5 (easy instance) and σ = 4 (hard instance). We run the experiment up to a horizon of T = 10, 000. To reduce the computational burden of solving PGL at each round, we only update η (t) when t = 2k for k {0, . . . , log2 T }. We used n = 100 discretization points.