reproducibilityindex.ai

Modeling and Optimization Trade-off in Meta-learning

Authors: Katelyn Gao, Ozan Sener

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also empirically study this trade-off for meta-reinforcement learning benchmarks. We empirically study this trade-off in meta-reinforcement learning (Section 2). We compare DRS and MAML on on a wide range of meta-RL benchmarks used in the literature.
Researcher Affiliation	Industry	We would like to thank the other members of the Intelligent Systems Lab at Intel and Amir Zamir for feedback on the ﬁrst draft of this paper, and Sona Jeswani for assistance in running an early version of the RL experiments.
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about open-sourcing its code or a link to a code repository for the methodology described.
Open Datasets	Yes	We compare DRS and MAML on on a wide range of meta-RL benchmarks used in the literature. We consider... four manipulation environments from Meta World [33]... All environments utilize the Mujoco simulator [27].
Dataset Splits	Yes	Meta-learning needs samples for both the inner optimization and outer meta problem; usually half of the samples are used for each. Suppose that during each iteration of meta-training MAML, M tasks are sampled each with 2N calls to their gradient oracles, half of which are used in the inner optimization...
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies	No	The paper mentions software components like 'Mujoco simulator [27]' and algorithms like PPO and TRPO, but it does not specify version numbers for any software dependencies.
Experiment Setup	Yes	During evaluation, we ﬁrst sample a set of meta-test tasks. For each meta-test task, starting at a trained policy, we repeat the following ﬁve times: generate a small number of episodes from the current policy and update the policy using the policy gradient algorithm from the inner optimization of Pro MP and TRPO-MAML. We compute the average episodic reward after t updates, for t = 0, 1, . . . , 5.