Doubly Robust Augmented Transfer for Meta-Reinforcement Learning
Authors: Yuankun Jiang, Nuowen Kan, Chenglin Li, Wenrui Dai, Junni Zou, Hongkai Xiong
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement DRa T on an off-policy meta-RL baseline, and empirically show that it significantly outperforms other hindsight-based approaches on various sparse-reward Mu Jo Co locomotion tasks with varying dynamics and reward functions. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, 2Department of Electronic Engineering Shanghai Jiao Tong University |
| Pseudocode | Yes | Algorithm 1 Doubly Robust augmented Transfer (DRa T) for Meta-RL |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or provide a link to a code repository. |
| Open Datasets | No | The paper mentions using 'Mu Jo Co' environments and generating variations by 'randomly sampling the environment parameters' but does not provide concrete access information (link, DOI, specific citation) for a publicly available or open dataset. |
| Dataset Splits | No | The paper states that a 'test task set' is 'disjoint with the training task set' for evaluation, but it does not specify explicit percentages, sample counts, or specific methodologies for train/validation/test data splits within or across these tasks. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Mu Jo Co [13]' as a physics engine, but it does not provide specific version numbers for MuJoCo or any other software libraries or frameworks used in the implementation (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For the training, we use a batch size of 256 for all environments, except for Humanoid, which uses 512 due to the larger state space. We sample 10 trajectories for meta-training for each task, and each trajectory contains 50 timesteps. For evaluation, we sample 5 trajectories each containing 50 timesteps. For the Adam optimizer, the learning rate for the meta-critic and meta-policy is set to 3e-4, while for the context network, it is set to 3e-4 or 3e-5 depending on the environment. The discount factor γ is set to 0.99. The update frequency for the target network is 1. We also set the Soft Actor Critic s temperature parameter α to 0.2. |