Offline Contextual Bandits with Overparameterized Models

Authors: David Brandfonbrener, William Whitney, Rajesh Ranganath, Joan Bruna

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate the gap in both action stability and bandit error between policy-based and value-based algorithms when using large neural network models on synthetic and image-based datasets.
Researcher Affiliation Academia 1Courant Institute of Mathematical Sciences, New York University, New York, New York, USA.
Pseudocode No The paper describes algorithms and mathematical formulations (e.g., Equations 1-6) but does not include structured pseudocode blocks or algorithms labeled as such.
Open Source Code Yes Code can be found at https://github.com/ davidbrandfonbrener/deep-offline-bandits.
Open Datasets Yes We will use the a bandit version of CIFAR-10 (Krizhevsky, 2009). To turn CIFAR into an offline bandit problem we view each possible label as an action and assign reward of 1 for a correct label/action and 0 for an incorrect label/action.
Dataset Splits No For these experiments we set K = 2, d = 10, ϵ = 0.1. We take N = 100 training points and sample an independent test set of 500 points. (The paper specifies training and test sets but does not mention a separate validation set or cross-validation setup.)
Hardware Specification No The paper mentions using 'MLPs' and 'Resnet-18' models, but it does not specify any hardware details such as particular GPU or CPU models, or memory configurations used for running the experiments.
Software Dependencies No We train Resnet-18 (He et al., 2016) models using Pytorch (Paszke et al., 2019). (The paper mentions Pytorch but does not specify a version number or other software dependencies with their versions.)
Experiment Setup Yes For these experiments we set K = 2, d = 10, ϵ = 0.1. We take N = 100 training points and sample an independent test set of 500 points. As our models we use MLPs with one hidden layer of width 512. (...) Full details about the training procedure along with learning curves and further results are in Appendix E.