Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits
Authors: Julian Zimmert, Yevgeny Seldin
JMLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide empirical evaluation of the algorithm demonstrating that it significantly outperforms Ucb1 and Exp3 in stochastic environments. We also provide examples of adversarial environments, where Ucb1 and Thompson Sampling exhibit almost linear regret, whereas our algorithm suffers only logarithmic regret. To the best of our knowledge, this is the first example demonstrating vulnerability of Thompson Sampling in adversarial environments. Last but not least, we present a general stochastic analysis and a general adversarial analysis of OMD algorithms with Tsallis entropy regularization for α [0, 1] and explain the reason why α = 1/2 works best. |
| Researcher Affiliation | Academia | Julian Zimmert EMAIL Yevgeny Seldin EMAIL University of Copenhagen, Copenhagen, Denmark |
| Pseudocode | Yes | Algorithm 1: Online Mirror Descent for bandits... Algorithm 2: Newton s Method approximation of wt in Tsallis-Inf (α = 1 |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository. Phrases like 'we release our code' or direct links are absent. |
| Open Datasets | No | The paper describes experiments conducted in simulated 'stochastic MAB' and 'stochastically constrained adversarial' environments, where parameters like mean rewards and gaps are defined and varied. It does not mention using any pre-existing publicly available datasets, nor does it provide access information for any generated data. |
| Dataset Splits | No | The paper describes simulated environments rather than using pre-existing datasets with defined splits. It mentions '100 repetitions' for experiments and varying parameters (K, gaps), but not dataset splits like training, validation, or test sets. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, memory, or cloud computing specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions several algorithms (Ucb1, Thompson Sampling, Exp3, EXP3++, Broad) as baselines and their respective authors/papers, but does not list specific software or library names with version numbers that were used for implementation or experimentation. |
| Experiment Setup | Yes | The first experiment, shown in Figures 1 and 2, is a standard stochastic MAB, where the mean rewards are (1 + ε)/2 for the single optimal arm and (1 - ε)/2 for all the suboptimal arms. The number of arms K and the gaps ε are varied as described in the figures. ...The second experiment, shown in Figures 3 and 4, simulates stochastically constrained adversaries. The mean loss of (optimal arm, all sub-optimal arms) switches between (1-ε, 1) and (0, ε), while staying unchanged for phases that are increasing exponentially in length. ...The pseudo-regret is estimated by 100 repetitions of the corresponding experiments and two standard deviations of the empirical pseudo-regret... Ucb1 (Auer et al., 2002a, with parameter α = 1.5) and Thompson Sampling (Thompson, 1933)... EXP3++ with parametrization proposed by Seldin and Lugosi (2017) and Broad (Wei and Luo, 2018). |