Meta-Learning for Simple Regret Minimization

Authors: Javad Azizi, Branislav Kveton, Mohammad Ghavamzadeh, Sumeet Katariya

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we complement our theory with experiments (Section 7), which show the benefits of meta-learning and confirm that the Bayesian approaches are superior whenever implementable. In this section, we empirically compare our algorithms by their average meta simple regret over 100 simulation runs.
Researcher Affiliation Collaboration 1University of Southern California 2Amazon 3Google Research azizim@usc.edu, bkveton@amazon.com, ghavamza@google.com, katsumee@amazon.com
Pseudocode Yes Algorithm 1: Bayesian Meta-SRM (B-meta SRM) and Algorithm 2: Frequentist Meta-SRM (f-meta SRM)
Open Source Code No The paper does not contain any explicit statement about making its code open source or provide a link to a code repository for its methodology.
Open Datasets No The paper mentions simulations and refers to a 'real-world dataset in Appendix F.1' but does not provide concrete access information (link, DOI, specific citation with author/year for public access) for any dataset used in the experiments.
Dataset Splits No The paper describes its evaluation based on '100 simulation runs' and interacting with 'm bandit problems with arm set A that appear sequentially', but does not provide specific train/validation/test dataset splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud instance specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiments.
Experiment Setup Yes All experiments have m = 200 tasks with n = 100 rounds in each. Specifically, we assume that A = [K] are K arms with a Gaussian reward distribution νs(a; µs) = N(µs(a), 10^2), so σ = 10. The mean reward is sampled as µs ~ Pθ = N(θ , 0.12^2IK), so Σ0 = 0.12^2IK. The prior parameter is sampled from meta-prior as θ ~ Q = N(0K, IK), i.e., Σq = IK. We tune m0 and report the point-wise best performance for each task.