From Finite to Countable-Armed Bandits
Authors: Anand Kalvit, Assaf Zeevi
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also demonstrate empirically that a key concentration property of UCB1 that is pivotal to the aforementioned regret guarantee, is violated for Thompson Sampling (TS) and therefore, ALG(TS, Θ, 2) suffers linear regret. A formal statement of said concentration property of UCB1 is deferred to 4. The regret upper bound of ALG(UCB1, Θ, 2) is stated next in Theorem 3. Following is an auxiliary proposition that is useful towards Theorem 3. Empirical illustration. Figure 2 shows the histogram of the fraction of time a particular arm of a two-armed bandit having statistically identical arms with Bernoulli(0.5) rewards each was played under different algorithms. The leftmost plot corresponds to UCB1 and is evidently in consonance with the concentration property stated in part (i) of Theorem 4. The concentration phenomenon under UCB1 can be understood through the lens of reward stochasticity. Consider the simplest case where the rewards are deterministic. Then, we know from the structure of UCB1 that any arm is played at most twice before the algorithm switches over to the other arm. This results in N1(n)/n converging to 1/2 pathwise, with an arm switch-over time that is at most 2. As the reward stochasticity increases, so does the arm switch-over time, which adversely affects this convergence. While it is a priori unclear whether N1(n)/n would still converge to 1/2 in some mode if the rewards are stochastic, part (ii) of Theorem 4 states that the convergence indeed holds, albeit only in probability. A significant spread around 1/2 in the leftmost plot despite n = 104 plays indicates a rather slow convergence. |
| Researcher Affiliation | Academia | Anand Kalvit1 and Assaf Zeevi2 Graduate School of Business Columbia University New York, USA {1akalvit22,2assaf}@gsb.columbia.edu |
| Pseudocode | Yes | Algorithm 1 ETC(2): ETC for an infinite population of arms with |T | = 2. ... Algorithm 2 ALG(Ξ, Θ, 2): An algorithmic framework for countable-armed bandits with |T | = 2. |
| Open Source Code | No | The paper does not provide any statements about releasing source code or links to a code repository for their methodology. |
| Open Datasets | No | The paper describes using synthetic reward distributions such as "Bernoulli(µi) rewards", "Gaussian(0.5,1) rewards", and "Bernoulli(0.5) rewards" for simulations. It does not use or provide access to a pre-existing, publicly available dataset in the conventional sense (e.g., CIFAR-10, MNIST). |
| Dataset Splits | No | The paper does not describe specific training, validation, or test dataset splits. It relies on theoretical analysis and simulations where data is generated from distributions, rather than processed from fixed datasets with predefined splits. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU models, CPU types, memory) used for running the simulations or computations. |
| Software Dependencies | No | The paper mentions algorithms like UCB1 and Thompson Sampling, but does not specify any software libraries, frameworks, or their version numbers that would be needed to reproduce the work. |
| Experiment Setup | Yes | Theorem 3: "where m0 0 and γ > 2 are user-defined parameters that ensure Θ satisfies the conditions of Proposition 1 (for example, m0 = 11 and γ = 2.1 is an acceptable configuration)." Figure 1: "Monte-Carlo estimates of β plotted against using (2) with m0 = 4000 and γ = 2.1." Figure 2: "Bernoulli(0.5) rewards: Histogram of the fraction of plays of arm 1 until time n = 10,000, i.e., N1 104 /104, under three different algorithms. Number of replications under each algorithm ℵ= 20,000." |