Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Tightening Regret Lower and Upper Bounds in Restless Rising Bandits
Authors: Cristiano Migali, Marco Mussi, Gianmarco Genalti, Alberto Maria Metelli
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we investigate from a theoretical standpoint two well-known structured subclasses of restless MABs... we establish new lower bounds... Then, we introduce Rising Concave Budgeted Exploration (RC-BEpαq), a new regret minimization algorithm... we show that the expected cumulative regret of RC-BEpαq is in the order of r Op T 7{11q. These results collectively make a step towards closing the gap in rising concave MABs... Numerical simulations are provided in Section 5. ... The empirical cumulative regret suffered by the algorithms is shown in Figure 3b. |
| Researcher Affiliation | Academia | Cristiano Migali Politecnico di Milano, Milan, Italy EMAIL Marco Mussi Politecnico di Milano, Milan, Italy EMAIL Gianmarco Genalti Politecnico di Milano, Milan, Italy EMAIL Alberto Maria Metelli Politecnico di Milano, Milan, Italy EMAIL |
| Pseudocode | Yes | Algorithm 1 RC-BEpαq. |
| Open Source Code | Yes | The code to reproduce the results is available at https://github.com/m1gwings/rcbealpha-experiments. |
| Open Datasets | No | The algorithms are evaluated for T 5 106 rounds on synthetic instances with K 5 arms. The stochasticity is realized by adding Gaussian noise with standard deviation σ 0.1. The curves of the expected rewards have the functional form fptq cp1 expp sat{Tqq for t P JTK where a, c P p0, 1s, s 50, and are reported in Figure 3a. |
| Dataset Splits | No | The paper evaluates algorithms on synthetic instances over a learning horizon T and reports results averaged over multiple runs (e.g., "10 runs, mean 95% C.I."). This type of problem (Multi-Armed Bandits) does not typically involve traditional train/test/validation splits as seen in supervised learning tasks. |
| Hardware Specification | Yes | The simulations were run on a single CPU core with a clock frequency of 2.60 GHz. The system has a 8.0 Gi B RAM. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | The choices of the parameters of the algorithms that we compared are the following: Rexp3: VT K since, as remarked in Section 2, in the rising setting the cumulative increment is always smaller than or equal to K; T rp K lnp Kqq1{3p T{VT q2{3s; γ min ! 1, a K lnp Kq{p T pe 1qq ) as recommended in (Besbes et al., 2014). R-less-UCB: hi,t tϵNi,t 1u where Ni,t 1 is the number of times arm i has been pulled by the agent in the first t 1 rounds, with ϵ P p0, 1{2q; α ą 2 as prescribed in (Metelli et al., 2022). In particular, we choose ϵ 0.25; α 2.1. UCB1: the upper confidence bound interval for arm i at round t is σ a 4 lnptq{Ni,t 1. RC-BEpαq: the parameter α of RC-BEpαq is set to α 8{3, as suggested by Theorem 4.4. |