Differentiable Meta-Learning of Bandit Policies
Authors: Craig Boutilier, Chih-wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, Manzil Zaheer
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show the versatility of our approach. We also observe that neural network policies can learn implicit biases expressed only through the sampled instances. |
| Researcher Affiliation | Collaboration | Craig Boutilier Google Research Chih-Wei Hsu Google Research Branislav Kveton Google Research Martin Mladenov Google Research Csaba Szepesvári Deep Mind / University of Alberta Manzil Zaheer Google Research |
| Pseudocode | Yes | Algorithm 1 Gradient-based optimization of bandit policies. 1: Inputs: Policy parameters w0 W, number of iterations L, learning rate α, and batch size m 2: w w0 3: for ℓ= 1, . . . , L do 4: for j = 1, . . . , m do 5: Sample P j P; sample Y j P j; and apply policy πw to Y j to get Ij 6: Let ˆg(n; πw) be an estimate of wr(n; πw) from (Y j)m j=1 and (Ij)m j=1 7: w w + α ˆg(n; πw) 8: Output: Learned policy parameters w |
| Open Source Code | No | The paper mentions implementing experiments in TensorFlow and PyTorch, which are third-party tools, but does not state that the authors' own code for the described methodology is open-source or provide a link. |
| Open Datasets | No | The paper describes how problem instances and rewards are sampled or drawn from a prior distribution (e.g., 'prior distribution P is over two problem instances, µ = (0.6, 0.4) and µ = (0.4, 0.6), both with probability 0.5'), but it does not provide concrete access information (like a URL, DOI, or specific citation for a downloadable dataset) for a publicly available or open dataset used for training or general experimentation. |
| Dataset Splits | No | The paper describes generating 'sampled problem instances' for optimization and for estimating regret, but it does not specify explicit training/validation/test dataset splits (e.g., exact percentages or sample counts) for a pre-defined dataset. |
| Hardware Specification | Yes | Our experiments are implemented in Tensor Flow and Py Torch, on 112 cores and with 392 GB RAM. |
| Software Dependencies | No | The paper mentions 'Tensor Flow and Py Torch' but does not specify their version numbers. |
| Experiment Setup | Yes | The policies are optimized by Grad Band with w0 = 1, L = 100 iterations, learning rate α = c 1L 1 2 , and batch size m = 1 000. |