Boltzmann Exploration Done Right
Authors: Nicolò Cesa-Bianchi, Claudio Gentile, Gabor Lugosi, Gergely Neu
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Experiments This section concludes by illustrating our theoretical results through some experiments, highlighting the limitations of Boltzmann exploration and contrasting it with the performance of Boltzmann Gumbel exploration. We consider a stochastic multi-armed bandit problem with K = 10 arms each yielding Bernoulli rewards with mean µi = 1/2 for all suboptimal arms i > 1 and µ1 = 1/2 + for the optimal arm. We set the horizon to T = 106 and the gap parameter to = 0.01. We compare three variants of Boltzmann exploration with inverse learning rate parameters βt = C2 (BE-const), βt = C2/ log t (BE-log), and t (BE-sqrt) for all t, and compare it with Boltzmann Gumbel exploration (BGE), and UCB with exploration bonus p C2 log(t)/Nt,i. We study two different scenarios: (a) all rewards drawn i.i.d. from the Bernoulli distributions with the means given above and (b) the first T0 = 5,000 rewards set to 0 for arm 1. The latter scenario simulates the situation described in the proof of Theorem 1, and in particular exposes the weakness of Boltzmann exploration with increasing learning rate parameters. The results shown on Figure 1 (a) and (b) show that while some variants of Boltzmann exploration may perform reasonably well when initial rewards take typical values and the parameters are chosen luckily, all standard versions fail to identify the optimal arm when the initial draws are not representative of the true mean (which happens with a small constant probability). On the other hand, UCB and Boltzmann Gumbel exploration continue to perform well even under this unlikely event, as predicted by their respective theoretical guarantees. Notably, Boltzmann Gumbel exploration performs comparably to UCB in this example (even slightly outperforming its competitor here), and performs notably well for the recommended parameter setting of C2 = σ2 = 1/4 (noting that Bernoulli random variables are 1/4-subgaussian). |
| Researcher Affiliation | Academia | Nicolò Cesa-Bianchi Università degli Studi di Milano Milan, Italy Claudio Gentile INRIA Lille Nord Europe Villeneuve d Ascq, France Gábor Lugosi ICREA & Universitat Pompeu Fabra Barcelona, Spain Gergely Neu Universitat Pompeu Fabra Barcelona, Spain |
| Pseudocode | No | No, the paper describes the algorithm in text (e.g., 'Our algorithm operates by independently drawing perturbations Zt,i from a standard Gumbel distribution for each arm i, then choosing action It+1 = arg max i {bµt,i + βt,i Zt,i} . (4)') but does not provide a structured pseudocode or algorithm block. |
| Open Source Code | No | No, the paper does not contain any statement about releasing source code or a link to a code repository. |
| Open Datasets | No | No, the paper describes a simulated stochastic multi-armed bandit problem ('K = 10 arms each yielding Bernoulli rewards') rather than using a pre-existing publicly available dataset. Therefore, no access information for a dataset is provided or required. |
| Dataset Splits | No | No, the paper describes a sequential decision-making process (multi-armed bandit) with a set horizon (T = 10^6 rounds) and two reward generation scenarios, rather than defining train/validation/test splits for a fixed dataset. |
| Hardware Specification | No | No, the paper describes the experimental setup in terms of problem parameters (e.g., K=10 arms, T=10^6 horizon) and reward distributions, but does not provide any specific hardware details such as CPU/GPU models or memory. |
| Software Dependencies | No | No, the paper mentions the algorithms implemented for comparison (e.g., Boltzmann Gumbel exploration, UCB) but does not list any specific software dependencies or their version numbers required for reproduction. |
| Experiment Setup | Yes | We consider a stochastic multi-armed bandit problem with K = 10 arms each yielding Bernoulli rewards with mean µi = 1/2 for all suboptimal arms i > 1 and µ1 = 1/2 + for the optimal arm. We set the horizon to T = 106 and the gap parameter to = 0.01. We compare three variants of Boltzmann exploration with inverse learning rate parameters βt = C2 (BE-const), βt = C2/ log t (BE-log), and t (BE-sqrt) for all t, and compare it with Boltzmann Gumbel exploration (BGE), and UCB with exploration bonus p C2 log(t)/Nt,i. ...Notably, Boltzmann Gumbel exploration performs comparably to UCB in this example... and performs notably well for the recommended parameter setting of C2 = σ2 = 1/4 (noting that Bernoulli random variables are 1/4-subgaussian). |