A Policy-Guided Imitation Approach for Offline Reinforcement Learning
Authors: Haoran Xu, Li Jiang, Li Jianxiong, Xianyuan Zhan
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test POR in widely-used D4RL offline RL benchmarks and demonstrates the state-of-the-art performance. We also highlight the benefits of POR in terms of improving with supplementary suboptimal data and easily adapting to new tasks by only changing the guide-policy. Code is available at https://github.com/ryanxhr/POR. 5 Experiments We present empirical evaluations of POR in this section. We first evaluate POR against other baseline algorithms on D4RL [12] benchmark datasets. We then explore deeper on the guide-policy about the benefits of the decoupled training process. We finally establish ablation studies on the execute-policy. |
| Researcher Affiliation | Collaboration | Haoran Xu Li Jiang Jianxiong Li Xianyuan Zhan , JD Technology, Beijing, China Tsinghua University, Beijing, China Shanghai AI Laboratory, Shanghai, China |
| Pseudocode | Yes | Algorithm 1 Policy Guided Offline RL |
| Open Source Code | Yes | Code is available at https://github.com/ryanxhr/POR. |
| Open Datasets | Yes | We first evaluate our approach on D4RL Mu Jo Co and Ant Maze datasets [12]. [12] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. Ar Xiv preprint, 2020. |
| Dataset Splits | No | The paper uses D4RL datasets and evaluates models periodically during training ("evaluate every 5000 time steps"), but it does not explicitly specify a distinct validation dataset split (e.g., in terms of percentages or sample counts for training, validation, and test sets). |
| Hardware Specification | Yes | All experiments were performed on a cluster equipped with NVIDIA GeForce RTX 3090 GPUs. |
| Software Dependencies | Yes | Our code is built upon PyTorch and Stable-Baselines3, using Python 3.8. |
| Experiment Setup | Yes | Full experimental details are included in Appendix D.1 and the learning curve can be found in Appendix E. Appendix D: General Experimental Setup ... We use Adam optimizer with learning rate 3e-4, batch size 256, and update every 100 gradient steps. The discount factor γ is set to 0.99 and the target update coefficient λ is set to 0.995. The expectile τ is set to 0.7 and the behavior cloning weight α is set to 0.1 for all locomotion datasets. The dropout rate is 0.0 for all layers. The latent dimension is 256 for all networks, and the network size is 256 for 3 layers. |