Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Refining the Confidence Level for Optimistic Bandit Strategies

Authors: Tor Lattimore

JMLR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Preliminary empirical evidence is also promising. Besides this, a conjecture on the optimal form of the regret is shown to be false and a ﬁnite-time lower bound on the regret of any strategy is presented that very nearly matches the ﬁnite-time upper bound of the newly proposed strategy.... Empirical evaluation ADA-UCB is compared to UCB (Katehakis and Robbins, 1995), MOSS (M enard and Garivier, 2017), THOMPSON SAMPLING (Agrawal and Goyal, 2012) and IMED (Honda and Takemura, 2015)... Figure 1: The regret of various algorithms as a function of when µ = ( , 0, . . . , 0) and the number of arms is 2, 10 and 100 respectively. The y-axis shows the regret averaged over N independent samples for each data point. Figure 2: The regret of various algorithms as a function of the horizon for the Gaussian bandit with k = 20 arms and payoff vector µ = (0, -0.03, ...)
Researcher Affiliation	Industry	Tor Lattimore EMAIL Deep Mind 5 New Street London, United Kingdom
Pseudocode	No	The paper describes the ADA-UCB strategy algorithmically within the text using mathematical expressions (e.g., 'At = arg maxi [k] γi(t) where the index of arm i in round t is: γi(t) = ˆµi(t 1) + v u u t2 Ti(t 1) log n Hi(t 1) with Hi(t) = Ti(t)Ki(t) and Ki(t) = Pk m=1 min n 1, q Ti(t 1)Tm(t 1) o '). However, it does not present these steps in a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it use structured formatting commonly associated with pseudocode.
Open Source Code	No	The paper does not contain any explicit statements about code release, nor does it provide links to source-code repositories for the methodology described.
Open Datasets	No	The paper focuses on theoretical analysis and empirical evaluation of bandit strategies using simulated environments. For instance, Figure 2 describes a 'Gaussian bandit with k = 20 arms and payoff vector µ = (0, -0.03, ...)'. This indicates the use of synthetically generated data rather than a specific, publicly available dataset that would require access information.
Dataset Splits	No	The paper conducts experiments on simulated bandit problems rather than pre-existing datasets. Therefore, the concept of training/test/validation dataset splits is not applicable. The paper specifies simulation parameters like 'k' (number of arms) and 'n' (time horizon) for its scenarios, but these are not dataset splits.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or simulations described in the 'Empirical evaluation' section.
Software Dependencies	No	The paper mentions various bandit algorithms (UCB, MOSS, THOMPSON SAMPLING, IMED) and theoretical concepts, but it does not specify any software names with version numbers that would be necessary to reproduce the experimental results (e.g., Python 3.x, specific libraries like TensorFlow, PyTorch, or scikit-learn with their versions).
Experiment Setup	Yes	All algorithms choose each arm once and subsequently: AIMED t = arg min i [k] 2 (ˆµi(t 1) max j [k] ˆµj(t 1))2 + log(Ti(t 1)) . AUCB t = arg max i [k] ˆµi(t 1) + v u u t2 log(t) Ti(t 1) . AMOSS t = arg max i [k] ˆµi(t 1) + v u u t2 Ti(t 1) log+ n k Ti(t 1) log2 + 1 + n k Ti(t 1) . ATS t = arg max i [k] θi(t) with θi(t) N(ˆµi(t 1), 1/Ti(t 1)) . In all plots N indicates the number of independent samples per data point and conﬁdence bands are calculated at a 95% level. The ﬁrst three plots in Figure 1 show the regret in the worst case regime where all suboptimal arms have the same suboptimality gap. Unsurprisingly the relative advantage of policies with well-tuned conﬁdence levels increases with the number of arms. At its worst, UCB suffers about three times the regret of ADA-UCB. Coincidentally, p log(104) 3.03. Figure 2 shows the regret as a function of the horizon n on a ﬁxed bandit with k = 20 arms (see caption of ﬁgure for means).