When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning

Authors: Haoyi Niu, shubham sharma, Yiwen Qiu, Ming Li, Guyue Zhou, Jianming HU, Xianyuan Zhan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present empirical evaluations of H2O. We start by describing our experimental environment setups and the cross-domain RL baselines for comparison. We then evaluate H2O against the baseline methods in simulation environments and on a real wheel-legged robot. Ablation studies and empirical analyses of H2O are also reported.
Researcher Affiliation Collaboration 1 Tsinghua University, Beijing, China 2 Indian Institute of Technology, Bombay, India 3 Shanghai Jiaotong University, Shanghai, China 4 Beijing National Research Center for Information Science and Technology, Beijing, China 5 Shanghai AI Laboratory, Shanghai, China {t6.da.thu,shubh.am1107z,qywmei,liming18739796090}@gmail.com hujm@mail.tsinghua.edu.cn {zhouguyue,zhanxianyuan}@air.tsinghua.edu.cn This work is supported by funding from Haomo.AI.
Pseudocode Yes Algorithm 1: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O)
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets Yes As for the offline dataset from the real world (original simulation environment), we use the datasets of the corresponding task from standard offline RL benchmark D4RL [Fu et al., 2020].
Dataset Splits No The paper uses standard D4RL datasets as its offline training data and evaluates policies in a separate 'real' environment, but it does not explicitly specify a validation split within these datasets or for its experimental setup.
Hardware Specification No The paper mentions software environments like Mu Jo Co and Isaac Gym, and general 'GPU-based physics simulation', but does not provide specific hardware details such as GPU/CPU models or memory used for experiments in the main text or appendix.
Software Dependencies No The paper mentions using 'PyTorch framework' and simulators 'Mu Jo Co' and 'Isaac Gym' with citations, but does not specify their version numbers or other software dependencies with explicit version details.
Experiment Setup Yes For training of all neural networks (Q-network, policy network, and discriminators), we use Adam optimizer [Kingma and Ba, 2014] with learning rate of 3e-4. The batch size is 256. For SAC, we use discount factor γ = 0.99, target update coefficient τ = 0.005, and replay buffer size of 1M.