DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning
Authors: Jinxin Liu, Zhang Hongyin, Donglin Wang
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental evaluation demonstrates that DARA, by augmenting rewards in the source offline dataset, can acquire an adaptive policy for the target environment and yet significantly reduce the requirement of target offline data. With only modest amounts of target offline data, our performance consistently outperforms the prior offline RL methods in both simulated and real-world tasks. |
| Researcher Affiliation | Academia | Jinxin Liu123 Hongyin Zhang1 Donglin Wang13 1 Westlake University. 2 Zhejiang University. 3 Institute of Advanced Technology, Westlake Institute for Advanced Study. {liujinxin, zhanghongyin, wangdonglin}@westlake.edu.cn |
| Pseudocode | Yes | Algorithm 1 Framework for Dynamics-Aware Reward Augmentation (DARA) |
| Open Source Code | Yes | In supplementary material, we upload our source code and the collected offline dataset for the the quadruped robot. |
| Open Datasets | Yes | Our experimental evaluation is conducted with publicly available D4RL (Fu et al., 2020) and Neo RL (Qin et al., 2021). ... In the sim2real setting (for the quadruped robot), we use the A1 dog from Unitree (Wang, 2020). |
| Dataset Splits | No | The paper mentions 'validation' implicitly in the context of D4RL (which has predefined splits), and 'training' with percentages of data (e.g., 10% of D4RL data), but it does not explicitly provide the specific validation dataset splits (e.g., exact percentages or sample counts for validation sets) or cross-validation methodology within the main text or appendices for its experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or cloud instance types) used for running its experiments. |
| Software Dependencies | No | With the above prior knowledge, domain randomization and reward function, we train our behavior policy with SAC (Haarnoja et al., 2018) in Py Bullet (Coumans & Bai, 2016 2021). |
| Experiment Setup | Yes | In our implementation, we set η = 0.1 for all simulated tasks and set η = 0.01 for the sim2real task. In Table 18, we also report the sensitivity of DARA on the hyper-parameters η. ... Both the behavior policy and value networks are Multilayer Perceptron (MLP) with 3 hidden layers, which have 256, 128 and 64 nodes. The activation function is the Tanh function, and the optimizer is Adam. |