DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning

Authors: Jinxin Liu, Zhang Hongyin, Donglin Wang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental evaluation demonstrates that DARA, by augmenting rewards in the source offline dataset, can acquire an adaptive policy for the target environment and yet significantly reduce the requirement of target offline data. With only modest amounts of target offline data, our performance consistently outperforms the prior offline RL methods in both simulated and real-world tasks.
Researcher Affiliation Academia Jinxin Liu123 Hongyin Zhang1 Donglin Wang13 1 Westlake University. 2 Zhejiang University. 3 Institute of Advanced Technology, Westlake Institute for Advanced Study. {liujinxin, zhanghongyin, wangdonglin}@westlake.edu.cn
Pseudocode Yes Algorithm 1 Framework for Dynamics-Aware Reward Augmentation (DARA)
Open Source Code Yes In supplementary material, we upload our source code and the collected offline dataset for the the quadruped robot.
Open Datasets Yes Our experimental evaluation is conducted with publicly available D4RL (Fu et al., 2020) and Neo RL (Qin et al., 2021). ... In the sim2real setting (for the quadruped robot), we use the A1 dog from Unitree (Wang, 2020).
Dataset Splits No The paper mentions 'validation' implicitly in the context of D4RL (which has predefined splits), and 'training' with percentages of data (e.g., 10% of D4RL data), but it does not explicitly provide the specific validation dataset splits (e.g., exact percentages or sample counts for validation sets) or cross-validation methodology within the main text or appendices for its experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies No With the above prior knowledge, domain randomization and reward function, we train our behavior policy with SAC (Haarnoja et al., 2018) in Py Bullet (Coumans & Bai, 2016 2021).
Experiment Setup Yes In our implementation, we set η = 0.1 for all simulated tasks and set η = 0.01 for the sim2real task. In Table 18, we also report the sensitivity of DARA on the hyper-parameters η. ... Both the behavior policy and value networks are Multilayer Perceptron (MLP) with 3 hidden layers, which have 256, 128 and 64 nodes. The activation function is the Tanh function, and the optimizer is Adam.