Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards
Authors: Omar Besbes, Yonatan Gur, Assaf Zeevi
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | The main contribution of this paper lies in fully characterizing the (regret) complexity of a broad class of MAB problems with non-stationary reward structure by establishing a direct link between the extent of reward variation and the minimal achievable regret. More specifically, the paper s contributions are along four dimensions. On the modeling side we formulate a class of non-stationary reward structure that is quite general, and hence can be used to realistically capture a variety of real-world type phenomena, yet is mathematically tractable. ... For a general class of non-stationary reward distributions we establish lower bounds on the performance of any non-anticipating policy relative to the dynamic oracle, and show that these bounds can be achieved, uniformly over the class of admissible reward distributions, by a suitable policy construction. The term achieved is meant in the sense of the order of the regret as a function of the time horizon T, the variation budget VT , and the number of arms K. Our policies are shown to be minimax optimal up to a term that is logarithmic in the number of arms, and the regret is sublinear and is of order (KVT )1/3 T 2/3. Our analysis complements studied non-stationary instances by treating a broad and flexible class of temporal changes in the reward distributions, yet still establishing optimality results and showing that sublinear regret is achievable. |
| Researcher Affiliation | Academia | Omar Besbes Columbia University New York, NY ob2105@columbia.edu Yonatan Gur Stanford University Stanford, CA ygur@stanford.edu Assaf Zeevi Columbia University New York, NY assaf@gsb.columbia.edu |
| Pseudocode | Yes | Rexp3. Inputs: a positive number γ, and a batch size T . 1. Set batch index j = 1 2. Repeat while j T/ T : (a) Set τ = (j 1) T (b) Initialization: for any k K set wk t = 1 (c) Repeat for t = τ + 1, . . . , min {T, τ + T }: For each k K, set pk t = (1 γ) wk t PK k =1 wk t + γ Draw an arm k from K according to the distribution pk t K k=1 Receive a reward Xk t For k set ˆXk t = Xk t /pk t , and for any k = k set ˆXk t = 0. For all k K update: wk t+1 = wk t exp ( γ ˆXk t K (d) Set j = j + 1, and return to the beginning of step 2 |
| Open Source Code | No | No, the paper does not provide an explicit statement or link indicating that the source code for the methodology described in this paper is openly available. |
| Open Datasets | No | No, this is a theoretical paper that does not involve empirical experiments with datasets. Therefore, it does not specify a publicly available training dataset. |
| Dataset Splits | No | No, this is a theoretical paper that does not involve empirical experiments with datasets, so it does not provide details on training/validation/test dataset splits. |
| Hardware Specification | No | No, this is a theoretical paper and does not mention any specific hardware used for running experiments. |
| Software Dependencies | No | No, this is a theoretical paper that describes algorithms and proofs, and as such, it does not mention specific software dependencies with version numbers for experimental reproducibility. |
| Experiment Setup | No | No, this is a theoretical paper that does not describe empirical experiments, and therefore does not include specific experimental setup details such as hyperparameter values or training configurations. |