Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Refining the Confidence Level for Optimistic Bandit Strategies

Authors: Tor Lattimore

JMLR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Preliminary empirical evidence is also promising. Besides this, a conjecture on the optimal form of the regret is shown to be false and a finite-time lower bound on the regret of any strategy is presented that very nearly matches the finite-time upper bound of the newly proposed strategy.... Empirical evaluation ADA-UCB is compared to UCB (Katehakis and Robbins, 1995), MOSS (M enard and Garivier, 2017), THOMPSON SAMPLING (Agrawal and Goyal, 2012) and IMED (Honda and Takemura, 2015)... Figure 1: The regret of various algorithms as a function of when µ = ( , 0, . . . , 0) and the number of arms is 2, 10 and 100 respectively. The y-axis shows the regret averaged over N independent samples for each data point. Figure 2: The regret of various algorithms as a function of the horizon for the Gaussian bandit with k = 20 arms and payoff vector µ = (0, -0.03, ...)
Researcher Affiliation Industry Tor Lattimore EMAIL Deep Mind 5 New Street London, United Kingdom
Pseudocode No The paper describes the ADA-UCB strategy algorithmically within the text using mathematical expressions (e.g., 'At = arg maxi [k] γi(t) where the index of arm i in round t is: γi(t) = ˆµi(t 1) + v u u t2 Ti(t 1) log n Hi(t 1) with Hi(t) = Ti(t)Ki(t) and Ki(t) = Pk m=1 min n 1, q Ti(t 1)Tm(t 1) o '). However, it does not present these steps in a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it use structured formatting commonly associated with pseudocode.
Open Source Code No The paper does not contain any explicit statements about code release, nor does it provide links to source-code repositories for the methodology described.
Open Datasets No The paper focuses on theoretical analysis and empirical evaluation of bandit strategies using simulated environments. For instance, Figure 2 describes a 'Gaussian bandit with k = 20 arms and payoff vector µ = (0, -0.03, ...)'. This indicates the use of synthetically generated data rather than a specific, publicly available dataset that would require access information.
Dataset Splits No The paper conducts experiments on simulated bandit problems rather than pre-existing datasets. Therefore, the concept of training/test/validation dataset splits is not applicable. The paper specifies simulation parameters like 'k' (number of arms) and 'n' (time horizon) for its scenarios, but these are not dataset splits.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or simulations described in the 'Empirical evaluation' section.
Software Dependencies No The paper mentions various bandit algorithms (UCB, MOSS, THOMPSON SAMPLING, IMED) and theoretical concepts, but it does not specify any software names with version numbers that would be necessary to reproduce the experimental results (e.g., Python 3.x, specific libraries like TensorFlow, PyTorch, or scikit-learn with their versions).
Experiment Setup Yes All algorithms choose each arm once and subsequently: AIMED t = arg min i [k] 2 (ˆµi(t 1) max j [k] ˆµj(t 1))2 + log(Ti(t 1)) . AUCB t = arg max i [k] ˆµi(t 1) + v u u t2 log(t) Ti(t 1) . AMOSS t = arg max i [k] ˆµi(t 1) + v u u t2 Ti(t 1) log+ n k Ti(t 1) log2 + 1 + n k Ti(t 1) . ATS t = arg max i [k] θi(t) with θi(t) N(ˆµi(t 1), 1/Ti(t 1)) . In all plots N indicates the number of independent samples per data point and confidence bands are calculated at a 95% level. The first three plots in Figure 1 show the regret in the worst case regime where all suboptimal arms have the same suboptimality gap. Unsurprisingly the relative advantage of policies with well-tuned confidence levels increases with the number of arms. At its worst, UCB suffers about three times the regret of ADA-UCB. Coincidentally, p log(104) 3.03. Figure 2 shows the regret as a function of the horizon n on a fixed bandit with k = 20 arms (see caption of figure for means).