reproducibilityindex.ai

Doubly Robust Augmented Transfer for Meta-Reinforcement Learning

Authors: Yuankun Jiang, Nuowen Kan, Chenglin Li, Wenrui Dai, Junni Zou, Hongkai Xiong

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implement DRa T on an off-policy meta-RL baseline, and empirically show that it significantly outperforms other hindsight-based approaches on various sparse-reward Mu Jo Co locomotion tasks with varying dynamics and reward functions.
Researcher Affiliation	Academia	1Department of Computer Science and Engineering, 2Department of Electronic Engineering Shanghai Jiao Tong University
Pseudocode	Yes	Algorithm 1 Doubly Robust augmented Transfer (DRa T) for Meta-RL
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or provide a link to a code repository.
Open Datasets	No	The paper mentions using 'Mu Jo Co' environments and generating variations by 'randomly sampling the environment parameters' but does not provide concrete access information (link, DOI, specific citation) for a publicly available or open dataset.
Dataset Splits	No	The paper states that a 'test task set' is 'disjoint with the training task set' for evaluation, but it does not specify explicit percentages, sample counts, or specific methodologies for train/validation/test data splits within or across these tasks.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies	No	The paper mentions 'Mu Jo Co [13]' as a physics engine, but it does not provide specific version numbers for MuJoCo or any other software libraries or frameworks used in the implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For the training, we use a batch size of 256 for all environments, except for Humanoid, which uses 512 due to the larger state space. We sample 10 trajectories for meta-training for each task, and each trajectory contains 50 timesteps. For evaluation, we sample 5 trajectories each containing 50 timesteps. For the Adam optimizer, the learning rate for the meta-critic and meta-policy is set to 3e-4, while for the context network, it is set to 3e-4 or 3e-5 depending on the environment. The discount factor γ is set to 0.99. The update frequency for the target network is 1. We also set the Soft Actor Critic s temperature parameter α to 0.2.