Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Policy-Guided Imitation Approach for Offline Reinforcement Learning
Authors: Haoran Xu, Li Jiang, Li Jianxiong, Xianyuan Zhan
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test POR in widely-used D4RL offline RL benchmarks and demonstrates the state-of-the-art performance. We also highlight the benefits of POR in terms of improving with supplementary suboptimal data and easily adapting to new tasks by only changing the guide-policy. Code is available at https://github.com/ryanxhr/POR. 5 Experiments We present empirical evaluations of POR in this section. We first evaluate POR against other baseline algorithms on D4RL [12] benchmark datasets. We then explore deeper on the guide-policy about the benefits of the decoupled training process. We finally establish ablation studies on the execute-policy. |
| Researcher Affiliation | Collaboration | Haoran Xu Li Jiang Jianxiong Li Xianyuan Zhan , JD Technology, Beijing, China Tsinghua University, Beijing, China Shanghai AI Laboratory, Shanghai, China |
| Pseudocode | Yes | Algorithm 1 Policy Guided Offline RL |
| Open Source Code | Yes | Code is available at https://github.com/ryanxhr/POR. |
| Open Datasets | Yes | We first evaluate our approach on D4RL Mu Jo Co and Ant Maze datasets [12]. [12] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. Ar Xiv preprint, 2020. |
| Dataset Splits | No | The paper uses D4RL datasets and evaluates models periodically during training ("evaluate every 5000 time steps"), but it does not explicitly specify a distinct validation dataset split (e.g., in terms of percentages or sample counts for training, validation, and test sets). |
| Hardware Specification | Yes | All experiments were performed on a cluster equipped with NVIDIA GeForce RTX 3090 GPUs. |
| Software Dependencies | Yes | Our code is built upon PyTorch and Stable-Baselines3, using Python 3.8. |
| Experiment Setup | Yes | Full experimental details are included in Appendix D.1 and the learning curve can be found in Appendix E. Appendix D: General Experimental Setup ... We use Adam optimizer with learning rate 3e-4, batch size 256, and update every 100 gradient steps. The discount factor γ is set to 0.99 and the target update coefficient λ is set to 0.995. The expectile τ is set to 0.7 and the behavior cloning weight α is set to 0.1 for all locomotion datasets. The dropout rate is 0.0 for all layers. The latent dimension is 256 for all networks, and the network size is 256 for 3 layers. |