Multi-Task Off-Policy Learning from Bandit Feedback
Authors: Joey Hong, Branislav Kveton, Manzil Zaheer, Sumeet Katariya, Mohammad Ghavamzadeh
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate the benefits of using the hierarchy over solving each task independently. (Abstract) ... In this section, we empirically compare Hier OPO to baselines Oracle OPO and Flat OPO (Section 3.4). (Section 7) ... In Figure 2, we show the mean and standard error of the suboptimality of each algorithm averaged over 30 random runs. (Section 7.1) |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley 2Amazon 3Deep Mind 4Google Research. Correspondence to: Joey Hong <joey hong@berkeley.edu>. |
| Pseudocode | Yes | Algorithm 1 Hier OPO: Hierarchical off-policy optimization. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | The problem is simulated using the Movie Lens 1M dataset (Lam & Herlocker, 2016)... (Section 7.2) ... We consider using Omniglot (Lake et al., 2015), which is a dataset of 1623 handwritten characters from 50 different alphabets and contains 20 examples per character. (Appendix D) |
| Dataset Splits | No | The paper does not provide specific percentages, sample counts, or clear predefined split strategies (e.g., 80/10/10 split or k-fold cross-validation) for dataset partitioning into training, validation, and test sets. It mentions using 'another logged dataset of size 10 000' for evaluation and reserving '20 alphabets' for evaluation in the Omniglot experiment, but lacks the detailed split information required for reproduction. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments (e.g., exact GPU/CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation of the experiments. |
| Experiment Setup | Yes | We set α = 0.1, which led to good performance in our initial experiments. (Section 7) ... Our first experiment is with a synthetic multi-task bandit, with d = 5 features and K = 10 actions. ... The hyper-prior is N(0d, Σq), where Σq = σ2 q Id is its covariance. The task covariance is Σ0 = σ2 0Id. We experiment with σq {0.5, 1} and σ0 = 0.5. (Section 7.1) |