Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation
Authors: Yunhao Tang, Tadashi Kozuno, Mark Rowland, Remi Munos, Michal Valko
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now carry out several empirical studies to complement the framework developed above. In Section 5.1, we use a tabular example to investigate the bias and variance trade-offs of various estimates, to assess the validity of our theoretical insights. In Section 5.2.1 and Section 5.2.2, we apply the new second-order estimate in high-dimensional meta-RL experiments, to assess the potential performance gains in a more practical setup. |
| Researcher Affiliation | Collaboration | Yunhao Tang* Columbia University EMAIL Tadashi Kozuno* University of Alberta EMAIL Mark Rowland Deep Mind London EMAIL Rémi Munos Deep Mind Paris EMAIL Michal Valko Deep Mind Paris EMAIL |
| Pseudocode | Yes | Algorithm 1 Pseudocode for computing meta-gradients for the MAML objective ... Algorithm 2 Example: an off-policy evaluation subroutine for computing the DR estimate |
| Open Source Code | Yes | We open source the code to reproduce our results1. 1https://github.com/robintyh1/neurips2021-meta-gradient-offpolicy-evaluation |
| Open Datasets | No | The paper describes using simulated environments (e.g., random MDPs, 2-D navigation task, MuJoCo locomotion tasks) and sampling trajectories from them, rather than utilizing pre-existing, publicly accessible datasets with direct download links or formal citations. |
| Dataset Splits | No | The paper describes how trajectories are sampled during experiments ('B = 20 trajectories sampled from the environment'), but it does not specify explicit training, validation, and test dataset splits with percentages or counts for reproducibility. |
| Hardware Specification | No | The authors mention using 'a cluster' from OIST but do not provide specific details about the hardware components such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions software like 'Tensorflow or Py Torch' and 'Mu Jo Co' but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Experiment setup. We adapt the open source code base by [24] and adopt exactly the same experimental setup as [24]. At each iteration, the agent samples n = 40 task variables. For each task, the agent carries out K = 1 adaptation computed based on B = 20 trajectories sampled from the environment, each of length T = 100. |