reproducibilityindex.ai

DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning

Authors: Jinxin Liu, Zhang Hongyin, Donglin Wang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental evaluation demonstrates that DARA, by augmenting rewards in the source ofﬂine dataset, can acquire an adaptive policy for the target environment and yet signiﬁcantly reduce the requirement of target ofﬂine data. With only modest amounts of target ofﬂine data, our performance consistently outperforms the prior ofﬂine RL methods in both simulated and real-world tasks.
Researcher Affiliation	Academia	Jinxin Liu123 Hongyin Zhang1 Donglin Wang13 1 Westlake University. 2 Zhejiang University. 3 Institute of Advanced Technology, Westlake Institute for Advanced Study. {liujinxin, zhanghongyin, wangdonglin}@westlake.edu.cn
Pseudocode	Yes	Algorithm 1 Framework for Dynamics-Aware Reward Augmentation (DARA)
Open Source Code	Yes	In supplementary material, we upload our source code and the collected ofﬂine dataset for the the quadruped robot.
Open Datasets	Yes	Our experimental evaluation is conducted with publicly available D4RL (Fu et al., 2020) and Neo RL (Qin et al., 2021). ... In the sim2real setting (for the quadruped robot), we use the A1 dog from Unitree (Wang, 2020).
Dataset Splits	No	The paper mentions 'validation' implicitly in the context of D4RL (which has predefined splits), and 'training' with percentages of data (e.g., 10% of D4RL data), but it does not explicitly provide the specific validation dataset splits (e.g., exact percentages or sample counts for validation sets) or cross-validation methodology within the main text or appendices for its experiments.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies	No	With the above prior knowledge, domain randomization and reward function, we train our behavior policy with SAC (Haarnoja et al., 2018) in Py Bullet (Coumans & Bai, 2016 2021).
Experiment Setup	Yes	In our implementation, we set η = 0.1 for all simulated tasks and set η = 0.01 for the sim2real task. In Table 18, we also report the sensitivity of DARA on the hyper-parameters η. ... Both the behavior policy and value networks are Multilayer Perceptron (MLP) with 3 hidden layers, which have 256, 128 and 64 nodes. The activation function is the Tanh function, and the optimizer is Adam.