Multi-Task Off-Policy Learning from Bandit Feedback

Authors: Joey Hong, Branislav Kveton, Manzil Zaheer, Sumeet Katariya, Mohammad Ghavamzadeh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate the benefits of using the hierarchy over solving each task independently. (Abstract) ... In this section, we empirically compare Hier OPO to baselines Oracle OPO and Flat OPO (Section 3.4). (Section 7) ... In Figure 2, we show the mean and standard error of the suboptimality of each algorithm averaged over 30 random runs. (Section 7.1)
Researcher Affiliation Collaboration 1University of California, Berkeley 2Amazon 3Deep Mind 4Google Research. Correspondence to: Joey Hong <joey hong@berkeley.edu>.
Pseudocode Yes Algorithm 1 Hier OPO: Hierarchical off-policy optimization.
Open Source Code No The paper does not contain any explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes The problem is simulated using the Movie Lens 1M dataset (Lam & Herlocker, 2016)... (Section 7.2) ... We consider using Omniglot (Lake et al., 2015), which is a dataset of 1623 handwritten characters from 50 different alphabets and contains 20 examples per character. (Appendix D)
Dataset Splits No The paper does not provide specific percentages, sample counts, or clear predefined split strategies (e.g., 80/10/10 split or k-fold cross-validation) for dataset partitioning into training, validation, and test sets. It mentions using 'another logged dataset of size 10 000' for evaluation and reserving '20 alphabets' for evaluation in the Omniglot experiment, but lacks the detailed split information required for reproduction.
Hardware Specification No The paper does not provide specific details about the hardware used for running its experiments (e.g., exact GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation of the experiments.
Experiment Setup Yes We set α = 0.1, which led to good performance in our initial experiments. (Section 7) ... Our first experiment is with a synthetic multi-task bandit, with d = 5 features and K = 10 actions. ... The hyper-prior is N(0d, Σq), where Σq = σ2 q Id is its covariance. The task covariance is Σ0 = σ2 0Id. We experiment with σq {0.5, 1} and σ0 = 0.5. (Section 7.1)