Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms

Authors: Mohsen Bayati, Nima Hamidi, Ramesh Johari, Khashayar Khosravi

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental However, numerical investigation reveals interesting behaviors. In Figure 1, we simulate several different algorithms over 400 simulations, for two pairs of T, k in the many-armed regime. Our extensive simulations in Section 6 and in the longer version of paper [7] show that these insights are robust to varying rewards and prior distributions. Indeed, similar results are obtained with Bernoulli rewards and general beta priors. Further, using simulations, we also observe that the same phenomenon arises in the contextual MAB setting, via simulations with synthetic and real-world data.
Researcher Affiliation Collaboration Mohsen Bayati Stanford University bayati@stanford.edu Nima Hamidi Stanford University hamidi@stanford.edu Ramesh Johari Stanford University rjohari@stanford.edu Khashayar Khosravi Google Research, NYC khosravi@google.com
Pseudocode Yes Algorithm 1 Asymptotically Optimal UCB... Algorithm 2 Subsampled UCB (SS-UCB)... Algorithm 3 Greedy... Algorithm 4 Subsampled Greedy (SS-Greedy)
Open Source Code Yes Our code is available at http://github.com/khashayarkhv/many-armed-bandit.
Open Datasets Yes We use the Letter Recognition Dataset [12] from the UCI repository.
Dataset Splits No The paper describes the experimental setup in terms of time horizon (T) and number of arms (k) for the multi-armed bandit problem, and the generation of synthetic reward functions and arm parameters. However, it does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts) typically associated with machine learning model training.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, cloud instance types) used to run the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, libraries, or solvers used in the experiments.
Experiment Setup Yes In Figure 1, we simulate several different algorithms over 400 simulations, for two pairs of T, k in the many-armed regime... Rewards are generated according to N(µi, 1), with µi U[0, 1]. The list of algorithms included is as follows. (1) UCB: Algorithm 1, (2) SS-UCB: Algorithm 2 with m = T, (3) Greedy: Algorithm 3, (4) SS-Greedy: Algorithm 4 with m = T 2/3 (see Theorem 5), (5) UCB-F: UCB-F algorithm of [27] with the choice of confidence set Et = 2 log(10 log t), (6) TS: Thompson Sampling algorithm [26, 22, 2], and (7) SS-TS: subsampled TS with m = T. We use the Letter Recognition Dataset [12] from the UCI repository... We generate k = 300 arms with parameters θ1, θ2, . . . , θk Rd (d will be specified shortly) and generate reward of arm i via Yit = X t θi +εit. We consider two experiments with d = 2 and d = 6... We generate 50 different instances, where we pick T = 8000 samples at random (from the original 20000 samples) and generate the arm parameters according to the uniform distribution on the ℓ2-ball in Rd, i.e., θi Ud = {u Rd : u 2 1}.