A Policy-Guided Imitation Approach for Offline Reinforcement Learning

Authors: Haoran Xu, Li Jiang, Li Jianxiong, Xianyuan Zhan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test POR in widely-used D4RL offline RL benchmarks and demonstrates the state-of-the-art performance. We also highlight the benefits of POR in terms of improving with supplementary suboptimal data and easily adapting to new tasks by only changing the guide-policy. Code is available at https://github.com/ryanxhr/POR. 5 Experiments We present empirical evaluations of POR in this section. We first evaluate POR against other baseline algorithms on D4RL [12] benchmark datasets. We then explore deeper on the guide-policy about the benefits of the decoupled training process. We finally establish ablation studies on the execute-policy.
Researcher Affiliation Collaboration Haoran Xu Li Jiang Jianxiong Li Xianyuan Zhan , JD Technology, Beijing, China Tsinghua University, Beijing, China Shanghai AI Laboratory, Shanghai, China
Pseudocode Yes Algorithm 1 Policy Guided Offline RL
Open Source Code Yes Code is available at https://github.com/ryanxhr/POR.
Open Datasets Yes We first evaluate our approach on D4RL Mu Jo Co and Ant Maze datasets [12]. [12] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. Ar Xiv preprint, 2020.
Dataset Splits No The paper uses D4RL datasets and evaluates models periodically during training ("evaluate every 5000 time steps"), but it does not explicitly specify a distinct validation dataset split (e.g., in terms of percentages or sample counts for training, validation, and test sets).
Hardware Specification Yes All experiments were performed on a cluster equipped with NVIDIA GeForce RTX 3090 GPUs.
Software Dependencies Yes Our code is built upon PyTorch and Stable-Baselines3, using Python 3.8.
Experiment Setup Yes Full experimental details are included in Appendix D.1 and the learning curve can be found in Appendix E. Appendix D: General Experimental Setup ... We use Adam optimizer with learning rate 3e-4, batch size 256, and update every 100 gradient steps. The discount factor γ is set to 0.99 and the target update coefficient λ is set to 0.995. The expectile τ is set to 0.7 and the behavior cloning weight α is set to 0.1 for all locomotion datasets. The dropout rate is 0.0 for all layers. The latent dimension is 256 for all networks, and the network size is 256 for 3 layers.