Conservative Contextual Linear Bandits
Authors: Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi Yadkori, Benjamin Van Roy
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we provide simulation results to illustrate the performance of the proposed CLUCB algorithm. We considered a time independent action set of 100 arms each having a time independent feature vector living in R4 space. These feature vectors and the parameter θ are randomly drawn from N 0, I4 such that the mean reward associated to each arm is positive. The observation noise at each time step is also generated independently from N(0, 1), and the mean reward of the baseline policy at any time is taken to be the reward associated to the 10 th best action. We have taken λ = 1, δ = 0.001 and the results are averaged over 1,000 realizations. In Figure 1, we plot per-step regret (i.e., Rt t ) of LUCB and CLUCB for different values of α over a horizon T = 40, 000. Figure 1 shows that per-step regret of CLUCB remains constant at the beginning (the conservative phase). This is because during this phase, CLUCB follows the baseline policy to make sure that the performance constraint (3) is satisfied. |
| Researcher Affiliation | Collaboration | Abbas Kazerouni Stanford University abbask@stanford.edu Mohammad Ghavamzadeh Deep Mind ghavamza@google.com Yasin Abbasi-Yadkori Adobe Research abbasiya@adobe.com Benjamin Van Roy Stanford University bvr@stanford.edu |
| Pseudocode | Yes | Algorithm 1 CLUCB |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | No | We considered a time independent action set of 100 arms each having a time independent feature vector living in R4 space. These feature vectors and the parameter θ are randomly drawn from N 0, I4 such that the mean reward associated to each arm is positive. The observation noise at each time step is also generated independently from N(0, 1), and the mean reward of the baseline policy at any time is taken to be the reward associated to the 10 th best action. |
| Dataset Splits | No | The paper describes simulation parameters and averaged results over multiple runs but does not provide specific training/validation/test dataset splits as it uses a simulated environment. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the simulations. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, specific libraries and their versions). |
| Experiment Setup | Yes | We have taken λ = 1, δ = 0.001 and the results are averaged over 1,000 realizations. |