Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation

Authors: Yunhao Tang, Tadashi Kozuno, Mark Rowland, Remi Munos, Michal Valko

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now carry out several empirical studies to complement the framework developed above. In Section 5.1, we use a tabular example to investigate the bias and variance trade-offs of various estimates, to assess the validity of our theoretical insights. In Section 5.2.1 and Section 5.2.2, we apply the new second-order estimate in high-dimensional meta-RL experiments, to assess the potential performance gains in a more practical setup.
Researcher Affiliation Collaboration Yunhao Tang* Columbia University yt2541@columbia.edu Tadashi Kozuno* University of Alberta tadashi.kozuno@gmail.com Mark Rowland Deep Mind London markrowland@deepmind.com Rémi Munos Deep Mind Paris munos@deepmind.com Michal Valko Deep Mind Paris valkom@deepmind.com
Pseudocode Yes Algorithm 1 Pseudocode for computing meta-gradients for the MAML objective ... Algorithm 2 Example: an off-policy evaluation subroutine for computing the DR estimate
Open Source Code Yes We open source the code to reproduce our results1. 1https://github.com/robintyh1/neurips2021-meta-gradient-offpolicy-evaluation
Open Datasets No The paper describes using simulated environments (e.g., random MDPs, 2-D navigation task, MuJoCo locomotion tasks) and sampling trajectories from them, rather than utilizing pre-existing, publicly accessible datasets with direct download links or formal citations.
Dataset Splits No The paper describes how trajectories are sampled during experiments ('B = 20 trajectories sampled from the environment'), but it does not specify explicit training, validation, and test dataset splits with percentages or counts for reproducibility.
Hardware Specification No The authors mention using 'a cluster' from OIST but do not provide specific details about the hardware components such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions software like 'Tensorflow or Py Torch' and 'Mu Jo Co' but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Experiment setup. We adapt the open source code base by [24] and adopt exactly the same experimental setup as [24]. At each iteration, the agent samples n = 40 task variables. For each task, the agent carries out K = 1 adaptation computed based on B = 20 trajectories sampled from the environment, each of length T = 100.