Conservative Contextual Linear Bandits

Authors: Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi Yadkori, Benjamin Van Roy

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we provide simulation results to illustrate the performance of the proposed CLUCB algorithm. We considered a time independent action set of 100 arms each having a time independent feature vector living in R4 space. These feature vectors and the parameter θ are randomly drawn from N 0, I4 such that the mean reward associated to each arm is positive. The observation noise at each time step is also generated independently from N(0, 1), and the mean reward of the baseline policy at any time is taken to be the reward associated to the 10 th best action. We have taken λ = 1, δ = 0.001 and the results are averaged over 1,000 realizations. In Figure 1, we plot per-step regret (i.e., Rt t ) of LUCB and CLUCB for different values of α over a horizon T = 40, 000. Figure 1 shows that per-step regret of CLUCB remains constant at the beginning (the conservative phase). This is because during this phase, CLUCB follows the baseline policy to make sure that the performance constraint (3) is satisfied.
Researcher Affiliation Collaboration Abbas Kazerouni Stanford University abbask@stanford.edu Mohammad Ghavamzadeh Deep Mind ghavamza@google.com Yasin Abbasi-Yadkori Adobe Research abbasiya@adobe.com Benjamin Van Roy Stanford University bvr@stanford.edu
Pseudocode Yes Algorithm 1 CLUCB
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets No We considered a time independent action set of 100 arms each having a time independent feature vector living in R4 space. These feature vectors and the parameter θ are randomly drawn from N 0, I4 such that the mean reward associated to each arm is positive. The observation noise at each time step is also generated independently from N(0, 1), and the mean reward of the baseline policy at any time is taken to be the reward associated to the 10 th best action.
Dataset Splits No The paper describes simulation parameters and averaged results over multiple runs but does not provide specific training/validation/test dataset splits as it uses a simulated environment.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the simulations.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, specific libraries and their versions).
Experiment Setup Yes We have taken λ = 1, δ = 0.001 and the results are averaged over 1,000 realizations.