reproducibilityindex.ai

Multi-Task Off-Policy Learning from Bandit Feedback

Authors: Joey Hong, Branislav Kveton, Manzil Zaheer, Sumeet Katariya, Mohammad Ghavamzadeh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate the benefits of using the hierarchy over solving each task independently. (Abstract) ... In this section, we empirically compare Hier OPO to baselines Oracle OPO and Flat OPO (Section 3.4). (Section 7) ... In Figure 2, we show the mean and standard error of the suboptimality of each algorithm averaged over 30 random runs. (Section 7.1)
Researcher Affiliation	Collaboration	1University of California, Berkeley 2Amazon 3Deep Mind 4Google Research. Correspondence to: Joey Hong <joey hong@berkeley.edu>.
Pseudocode	Yes	Algorithm 1 Hier OPO: Hierarchical off-policy optimization.
Open Source Code	No	The paper does not contain any explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	The problem is simulated using the Movie Lens 1M dataset (Lam & Herlocker, 2016)... (Section 7.2) ... We consider using Omniglot (Lake et al., 2015), which is a dataset of 1623 handwritten characters from 50 different alphabets and contains 20 examples per character. (Appendix D)
Dataset Splits	No	The paper does not provide specific percentages, sample counts, or clear predefined split strategies (e.g., 80/10/10 split or k-fold cross-validation) for dataset partitioning into training, validation, and test sets. It mentions using 'another logged dataset of size 10 000' for evaluation and reserving '20 alphabets' for evaluation in the Omniglot experiment, but lacks the detailed split information required for reproduction.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running its experiments (e.g., exact GPU/CPU models, memory, or cloud instance types).
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation of the experiments.
Experiment Setup	Yes	We set α = 0.1, which led to good performance in our initial experiments. (Section 7) ... Our first experiment is with a synthetic multi-task bandit, with d = 5 features and K = 10 actions. ... The hyper-prior is N(0d, Σq), where Σq = σ2 q Id is its covariance. The task covariance is Σ0 = σ2 0Id. We experiment with σq {0.5, 1} and σ0 = 0.5. (Section 7.1)