Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation
Authors: Yunhao Tang, Tadashi Kozuno, Mark Rowland, Remi Munos, Michal Valko
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now carry out several empirical studies to complement the framework developed above. In Section 5.1, we use a tabular example to investigate the bias and variance trade-offs of various estimates, to assess the validity of our theoretical insights. In Section 5.2.1 and Section 5.2.2, we apply the new second-order estimate in high-dimensional meta-RL experiments, to assess the potential performance gains in a more practical setup. |
| Researcher Affiliation | Collaboration | Yunhao Tang* Columbia University EMAIL Tadashi Kozuno* University of Alberta EMAIL Mark Rowland Deep Mind London EMAIL Rémi Munos Deep Mind Paris EMAIL Michal Valko Deep Mind Paris EMAIL |
| Pseudocode | Yes | Algorithm 1 Pseudocode for computing meta-gradients for the MAML objective ... Algorithm 2 Example: an off-policy evaluation subroutine for computing the DR estimate |
| Open Source Code | Yes | We open source the code to reproduce our results1. 1https://github.com/robintyh1/neurips2021-meta-gradient-offpolicy-evaluation |
| Open Datasets | No | The paper describes using simulated environments (e.g., random MDPs, 2-D navigation task, MuJoCo locomotion tasks) and sampling trajectories from them, rather than utilizing pre-existing, publicly accessible datasets with direct download links or formal citations. |
| Dataset Splits | No | The paper describes how trajectories are sampled during experiments ('B = 20 trajectories sampled from the environment'), but it does not specify explicit training, validation, and test dataset splits with percentages or counts for reproducibility. |
| Hardware Specification | No | The authors mention using 'a cluster' from OIST but do not provide specific details about the hardware components such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions software like 'Tensorflow or Py Torch' and 'Mu Jo Co' but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Experiment setup. We adapt the open source code base by [24] and adopt exactly the same experimental setup as [24]. At each iteration, the agent samples n = 40 task variables. For each task, the agent carries out K = 1 adaptation computed based on B = 20 trajectories sampled from the environment, each of length T = 100. |