Contextual Information-Directed Sampling
Authors: Botao Hao, Tor Lattimore, Chao Qin
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further propose a computationally-efficient version of contextual IDS based on Actor Critic and evaluate it empirically on a neural network contextual bandit. |
| Researcher Affiliation | Collaboration | 1Deepmind 2Columbia University. |
| Pseudocode | No | The paper describes algorithms and derivations in prose and mathematical notation but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | No | At each round t, the environment independently generates an observation in the form of d-dimensional contextual vector xt from some distributions. [...] The contextual vector xt R10 is sampled from N(0, I10) and the noise is sampled from standard Gaussian distribution. |
| Dataset Splits | No | The experiment is conducted in a simulated neural network contextual bandit environment, which continuously generates data. Therefore, traditional train/validation/test dataset splits are not applicable or specified. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. |
| Software Dependencies | No | We set the generative model fθ being a 2-hidden-layer Re LU MLP with 10 hidden neurons. [...] Both the policy network and value network are using 2-hidden-layer Re LU MLP with 20 hidden neurons and optimized by Adam with learning rate 0.001. |
| Experiment Setup | Yes | We set the generative model fθ being a 2-hidden-layer Re LU MLP with 10 hidden neurons. The number of actions is 5. The contextual vector xt R10 is sampled from N(0, I10) and the noise is sampled from standard Gaussian distribution. [...] we use 10 ensembles in our experiment. With 200 posterior samples, we use the same way described by Lu et al. (2021) to approximate the one-step regret and information gain for both conditional IDS and contextual IDS. We sample 20 independent contextual vectors at each round. Both the policy network and value network are using 2-hidden-layer Re LU MLP with 20 hidden neurons and optimized by Adam with learning rate 0.001. |