Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Doubly Robust Augmented Transfer for Meta-Reinforcement Learning
Authors: Yuankun Jiang, Nuowen Kan, Chenglin Li, Wenrui Dai, Junni Zou, Hongkai Xiong
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement DRa T on an off-policy meta-RL baseline, and empirically show that it significantly outperforms other hindsight-based approaches on various sparse-reward Mu Jo Co locomotion tasks with varying dynamics and reward functions. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, 2Department of Electronic Engineering Shanghai Jiao Tong University |
| Pseudocode | Yes | Algorithm 1 Doubly Robust augmented Transfer (DRa T) for Meta-RL |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or provide a link to a code repository. |
| Open Datasets | No | The paper mentions using 'Mu Jo Co' environments and generating variations by 'randomly sampling the environment parameters' but does not provide concrete access information (link, DOI, specific citation) for a publicly available or open dataset. |
| Dataset Splits | No | The paper states that a 'test task set' is 'disjoint with the training task set' for evaluation, but it does not specify explicit percentages, sample counts, or specific methodologies for train/validation/test data splits within or across these tasks. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Mu Jo Co [13]' as a physics engine, but it does not provide specific version numbers for MuJoCo or any other software libraries or frameworks used in the implementation (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For the training, we use a batch size of 256 for all environments, except for Humanoid, which uses 512 due to the larger state space. We sample 10 trajectories for meta-training for each task, and each trajectory contains 50 timesteps. For evaluation, we sample 5 trajectories each containing 50 timesteps. For the Adam optimizer, the learning rate for the meta-critic and meta-policy is set to 3e-4, while for the context network, it is set to 3e-4 or 3e-5 depending on the environment. The discount factor γ is set to 0.99. The update frequency for the target network is 1. We also set the Soft Actor Critic s temperature parameter α to 0.2. |