Contextual Information-Directed Sampling

Authors: Botao Hao, Tor Lattimore, Chao Qin

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further propose a computationally-efficient version of contextual IDS based on Actor Critic and evaluate it empirically on a neural network contextual bandit.
Researcher Affiliation Collaboration 1Deepmind 2Columbia University.
Pseudocode No The paper describes algorithms and derivations in prose and mathematical notation but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets No At each round t, the environment independently generates an observation in the form of d-dimensional contextual vector xt from some distributions. [...] The contextual vector xt R10 is sampled from N(0, I10) and the noise is sampled from standard Gaussian distribution.
Dataset Splits No The experiment is conducted in a simulated neural network contextual bandit environment, which continuously generates data. Therefore, traditional train/validation/test dataset splits are not applicable or specified.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments.
Software Dependencies No We set the generative model fθ being a 2-hidden-layer Re LU MLP with 10 hidden neurons. [...] Both the policy network and value network are using 2-hidden-layer Re LU MLP with 20 hidden neurons and optimized by Adam with learning rate 0.001.
Experiment Setup Yes We set the generative model fθ being a 2-hidden-layer Re LU MLP with 10 hidden neurons. The number of actions is 5. The contextual vector xt R10 is sampled from N(0, I10) and the noise is sampled from standard Gaussian distribution. [...] we use 10 ensembles in our experiment. With 200 posterior samples, we use the same way described by Lu et al. (2021) to approximate the one-step regret and information gain for both conditional IDS and contextual IDS. We sample 20 independent contextual vectors at each round. Both the policy network and value network are using 2-hidden-layer Re LU MLP with 20 hidden neurons and optimized by Adam with learning rate 0.001.