Variational Delayed Policy Optimization
Authors: Qingyuan Wu, Simon Zhan, Yixuan Wang, Yuhui Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, Chao Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We not only provide a theoretical analysis of VDPO in terms of sample complexity and performance, but also empirically demonstrate that VDPO can achieve consistent performance with SOTA methods, with a significant enhancement of sample efficiency (approximately 50% less amount of samples) in the Mu Jo Co benchmark. |
| Researcher Affiliation | Academia | Qingyuan Wu University of Southampton Simon Sinong Zhan Northwestern University Yixuan Wang Northwestern University Yuhui Wang King Abdullah University of Science and Technology Chung-Wei Lin National Taiwan University Chen Lv Nanyang Technological University Qi Zhu Northwestern University Chao Huang University of Southampton |
| Pseudocode | Yes | The pseudocode of VDPO is summarized in Alg. 1 |
| Open Source Code | Yes | Code is available at https://github.com/Qingyuan Wu Nothing/VDPO. |
| Open Datasets | Yes | We evaluate our VDPO in the Mu Jo Co benchmark [35]. |
| Dataset Splits | No | The paper does not provide specific dataset split information (e.g., exact percentages, sample counts, or detailed splitting methodology) for a validation set. |
| Hardware Specification | Yes | Each run of VDPO takes approximately 6 hours on 1 NVIDIA A100 GPU and 8 Intel Xeon CPUs. |
| Software Dependencies | No | The implementation of VDPO is based on Clean RL [16], and we also provide the code and guidelines to reproduce our results in the supplemental material. |
| Experiment Setup | Yes | The setting of hyper-parameters is presented in Appendix A. We investigate the sample efficiency (Sec. 4.2.1) followed by performance comparison under different settings of delays (Sec. 4.2.2). We also conduct the ablation study on the representation of VDPO (Sec. 4.2.3). Each method was run over 10 random seeds. The training curves can be found in the Appendix E. |