Offline Contextual Bandits with Overparameterized Models
Authors: David Brandfonbrener, William Whitney, Rajesh Ranganath, Joan Bruna
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate the gap in both action stability and bandit error between policy-based and value-based algorithms when using large neural network models on synthetic and image-based datasets. |
| Researcher Affiliation | Academia | 1Courant Institute of Mathematical Sciences, New York University, New York, New York, USA. |
| Pseudocode | No | The paper describes algorithms and mathematical formulations (e.g., Equations 1-6) but does not include structured pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | Code can be found at https://github.com/ davidbrandfonbrener/deep-offline-bandits. |
| Open Datasets | Yes | We will use the a bandit version of CIFAR-10 (Krizhevsky, 2009). To turn CIFAR into an offline bandit problem we view each possible label as an action and assign reward of 1 for a correct label/action and 0 for an incorrect label/action. |
| Dataset Splits | No | For these experiments we set K = 2, d = 10, ϵ = 0.1. We take N = 100 training points and sample an independent test set of 500 points. (The paper specifies training and test sets but does not mention a separate validation set or cross-validation setup.) |
| Hardware Specification | No | The paper mentions using 'MLPs' and 'Resnet-18' models, but it does not specify any hardware details such as particular GPU or CPU models, or memory configurations used for running the experiments. |
| Software Dependencies | No | We train Resnet-18 (He et al., 2016) models using Pytorch (Paszke et al., 2019). (The paper mentions Pytorch but does not specify a version number or other software dependencies with their versions.) |
| Experiment Setup | Yes | For these experiments we set K = 2, d = 10, ϵ = 0.1. We take N = 100 training points and sample an independent test set of 500 points. As our models we use MLPs with one hidden layer of width 512. (...) Full details about the training procedure along with learning curves and further results are in Appendix E. |