Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation
Authors: Yunhao Tang, Tadashi Kozuno, Mark Rowland, Remi Munos, Michal Valko
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now carry out several empirical studies to complement the framework developed above. In Section 5.1, we use a tabular example to investigate the bias and variance trade-offs of various estimates, to assess the validity of our theoretical insights. In Section 5.2.1 and Section 5.2.2, we apply the new second-order estimate in high-dimensional meta-RL experiments, to assess the potential performance gains in a more practical setup. |
| Researcher Affiliation | Collaboration | Yunhao Tang* Columbia University yt2541@columbia.edu Tadashi Kozuno* University of Alberta tadashi.kozuno@gmail.com Mark Rowland Deep Mind London markrowland@deepmind.com Rémi Munos Deep Mind Paris munos@deepmind.com Michal Valko Deep Mind Paris valkom@deepmind.com |
| Pseudocode | Yes | Algorithm 1 Pseudocode for computing meta-gradients for the MAML objective ... Algorithm 2 Example: an off-policy evaluation subroutine for computing the DR estimate |
| Open Source Code | Yes | We open source the code to reproduce our results1. 1https://github.com/robintyh1/neurips2021-meta-gradient-offpolicy-evaluation |
| Open Datasets | No | The paper describes using simulated environments (e.g., random MDPs, 2-D navigation task, MuJoCo locomotion tasks) and sampling trajectories from them, rather than utilizing pre-existing, publicly accessible datasets with direct download links or formal citations. |
| Dataset Splits | No | The paper describes how trajectories are sampled during experiments ('B = 20 trajectories sampled from the environment'), but it does not specify explicit training, validation, and test dataset splits with percentages or counts for reproducibility. |
| Hardware Specification | No | The authors mention using 'a cluster' from OIST but do not provide specific details about the hardware components such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions software like 'Tensorflow or Py Torch' and 'Mu Jo Co' but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Experiment setup. We adapt the open source code base by [24] and adopt exactly the same experimental setup as [24]. At each iteration, the agent samples n = 40 task variables. For each task, the agent carries out K = 1 adaptation computed based on B = 20 trajectories sampled from the environment, each of length T = 100. |