Modeling and Optimization Trade-off in Meta-learning
Authors: Katelyn Gao, Ozan Sener
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also empirically study this trade-off for meta-reinforcement learning benchmarks. We empirically study this trade-off in meta-reinforcement learning (Section 2). We compare DRS and MAML on on a wide range of meta-RL benchmarks used in the literature. |
| Researcher Affiliation | Industry | We would like to thank the other members of the Intelligent Systems Lab at Intel and Amir Zamir for feedback on the first draft of this paper, and Sona Jeswani for assistance in running an early version of the RL experiments. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about open-sourcing its code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We compare DRS and MAML on on a wide range of meta-RL benchmarks used in the literature. We consider... four manipulation environments from Meta World [33]... All environments utilize the Mujoco simulator [27]. |
| Dataset Splits | Yes | Meta-learning needs samples for both the inner optimization and outer meta problem; usually half of the samples are used for each. Suppose that during each iteration of meta-training MAML, M tasks are sampled each with 2N calls to their gradient oracles, half of which are used in the inner optimization... |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions software components like 'Mujoco simulator [27]' and algorithms like PPO and TRPO, but it does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | During evaluation, we first sample a set of meta-test tasks. For each meta-test task, starting at a trained policy, we repeat the following five times: generate a small number of episodes from the current policy and update the policy using the policy gradient algorithm from the inner optimization of Pro MP and TRPO-MAML. We compute the average episodic reward after t updates, for t = 0, 1, . . . , 5. |