Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards

Authors: Omar Besbes, Yonatan Gur, Assaf Zeevi

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical The main contribution of this paper lies in fully characterizing the (regret) complexity of a broad class of MAB problems with non-stationary reward structure by establishing a direct link between the extent of reward variation and the minimal achievable regret. More specifically, the paper s contributions are along four dimensions. On the modeling side we formulate a class of non-stationary reward structure that is quite general, and hence can be used to realistically capture a variety of real-world type phenomena, yet is mathematically tractable. ... For a general class of non-stationary reward distributions we establish lower bounds on the performance of any non-anticipating policy relative to the dynamic oracle, and show that these bounds can be achieved, uniformly over the class of admissible reward distributions, by a suitable policy construction. The term achieved is meant in the sense of the order of the regret as a function of the time horizon T, the variation budget VT , and the number of arms K. Our policies are shown to be minimax optimal up to a term that is logarithmic in the number of arms, and the regret is sublinear and is of order (KVT )1/3 T 2/3. Our analysis complements studied non-stationary instances by treating a broad and flexible class of temporal changes in the reward distributions, yet still establishing optimality results and showing that sublinear regret is achievable.
Researcher Affiliation Academia Omar Besbes Columbia University New York, NY ob2105@columbia.edu Yonatan Gur Stanford University Stanford, CA ygur@stanford.edu Assaf Zeevi Columbia University New York, NY assaf@gsb.columbia.edu
Pseudocode Yes Rexp3. Inputs: a positive number γ, and a batch size T . 1. Set batch index j = 1 2. Repeat while j T/ T : (a) Set τ = (j 1) T (b) Initialization: for any k K set wk t = 1 (c) Repeat for t = τ + 1, . . . , min {T, τ + T }: For each k K, set pk t = (1 γ) wk t PK k =1 wk t + γ Draw an arm k from K according to the distribution pk t K k=1 Receive a reward Xk t For k set ˆXk t = Xk t /pk t , and for any k = k set ˆXk t = 0. For all k K update: wk t+1 = wk t exp ( γ ˆXk t K (d) Set j = j + 1, and return to the beginning of step 2
Open Source Code No No, the paper does not provide an explicit statement or link indicating that the source code for the methodology described in this paper is openly available.
Open Datasets No No, this is a theoretical paper that does not involve empirical experiments with datasets. Therefore, it does not specify a publicly available training dataset.
Dataset Splits No No, this is a theoretical paper that does not involve empirical experiments with datasets, so it does not provide details on training/validation/test dataset splits.
Hardware Specification No No, this is a theoretical paper and does not mention any specific hardware used for running experiments.
Software Dependencies No No, this is a theoretical paper that describes algorithms and proofs, and as such, it does not mention specific software dependencies with version numbers for experimental reproducibility.
Experiment Setup No No, this is a theoretical paper that does not describe empirical experiments, and therefore does not include specific experimental setup details such as hyperparameter values or training configurations.