When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning
Authors: Haoyi Niu, shubham sharma, Yiwen Qiu, Ming Li, Guyue Zhou, Jianming HU, Xianyuan Zhan
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present empirical evaluations of H2O. We start by describing our experimental environment setups and the cross-domain RL baselines for comparison. We then evaluate H2O against the baseline methods in simulation environments and on a real wheel-legged robot. Ablation studies and empirical analyses of H2O are also reported. |
| Researcher Affiliation | Collaboration | 1 Tsinghua University, Beijing, China 2 Indian Institute of Technology, Bombay, India 3 Shanghai Jiaotong University, Shanghai, China 4 Beijing National Research Center for Information Science and Technology, Beijing, China 5 Shanghai AI Laboratory, Shanghai, China {t6.da.thu,shubh.am1107z,qywmei,liming18739796090}@gmail.com hujm@mail.tsinghua.edu.cn {zhouguyue,zhanxianyuan}@air.tsinghua.edu.cn This work is supported by funding from Haomo.AI. |
| Pseudocode | Yes | Algorithm 1: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O) |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | Yes | As for the offline dataset from the real world (original simulation environment), we use the datasets of the corresponding task from standard offline RL benchmark D4RL [Fu et al., 2020]. |
| Dataset Splits | No | The paper uses standard D4RL datasets as its offline training data and evaluates policies in a separate 'real' environment, but it does not explicitly specify a validation split within these datasets or for its experimental setup. |
| Hardware Specification | No | The paper mentions software environments like Mu Jo Co and Isaac Gym, and general 'GPU-based physics simulation', but does not provide specific hardware details such as GPU/CPU models or memory used for experiments in the main text or appendix. |
| Software Dependencies | No | The paper mentions using 'PyTorch framework' and simulators 'Mu Jo Co' and 'Isaac Gym' with citations, but does not specify their version numbers or other software dependencies with explicit version details. |
| Experiment Setup | Yes | For training of all neural networks (Q-network, policy network, and discriminators), we use Adam optimizer [Kingma and Ba, 2014] with learning rate of 3e-4. The batch size is 256. For SAC, we use discount factor γ = 0.99, target update coefficient τ = 0.005, and replay buffer size of 1M. |