Doubly Robust Augmented Transfer for Meta-Reinforcement Learning

Authors: Yuankun Jiang, Nuowen Kan, Chenglin Li, Wenrui Dai, Junni Zou, Hongkai Xiong

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement DRa T on an off-policy meta-RL baseline, and empirically show that it significantly outperforms other hindsight-based approaches on various sparse-reward Mu Jo Co locomotion tasks with varying dynamics and reward functions.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, 2Department of Electronic Engineering Shanghai Jiao Tong University
Pseudocode Yes Algorithm 1 Doubly Robust augmented Transfer (DRa T) for Meta-RL
Open Source Code No The paper does not contain any explicit statement about releasing source code or provide a link to a code repository.
Open Datasets No The paper mentions using 'Mu Jo Co' environments and generating variations by 'randomly sampling the environment parameters' but does not provide concrete access information (link, DOI, specific citation) for a publicly available or open dataset.
Dataset Splits No The paper states that a 'test task set' is 'disjoint with the training task set' for evaluation, but it does not specify explicit percentages, sample counts, or specific methodologies for train/validation/test data splits within or across these tasks.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper mentions 'Mu Jo Co [13]' as a physics engine, but it does not provide specific version numbers for MuJoCo or any other software libraries or frameworks used in the implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For the training, we use a batch size of 256 for all environments, except for Humanoid, which uses 512 due to the larger state space. We sample 10 trajectories for meta-training for each task, and each trajectory contains 50 timesteps. For evaluation, we sample 5 trajectories each containing 50 timesteps. For the Adam optimizer, the learning rate for the meta-critic and meta-policy is set to 3e-4, while for the context network, it is set to 3e-4 or 3e-5 depending on the environment. The discount factor γ is set to 0.99. The update frequency for the target network is 1. We also set the Soft Actor Critic s temperature parameter α to 0.2.