Variational Delayed Policy Optimization

Authors: Qingyuan Wu, Simon Zhan, Yixuan Wang, Yuhui Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, Chao Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We not only provide a theoretical analysis of VDPO in terms of sample complexity and performance, but also empirically demonstrate that VDPO can achieve consistent performance with SOTA methods, with a significant enhancement of sample efficiency (approximately 50% less amount of samples) in the Mu Jo Co benchmark.
Researcher Affiliation Academia Qingyuan Wu University of Southampton Simon Sinong Zhan Northwestern University Yixuan Wang Northwestern University Yuhui Wang King Abdullah University of Science and Technology Chung-Wei Lin National Taiwan University Chen Lv Nanyang Technological University Qi Zhu Northwestern University Chao Huang University of Southampton
Pseudocode Yes The pseudocode of VDPO is summarized in Alg. 1
Open Source Code Yes Code is available at https://github.com/Qingyuan Wu Nothing/VDPO.
Open Datasets Yes We evaluate our VDPO in the Mu Jo Co benchmark [35].
Dataset Splits No The paper does not provide specific dataset split information (e.g., exact percentages, sample counts, or detailed splitting methodology) for a validation set.
Hardware Specification Yes Each run of VDPO takes approximately 6 hours on 1 NVIDIA A100 GPU and 8 Intel Xeon CPUs.
Software Dependencies No The implementation of VDPO is based on Clean RL [16], and we also provide the code and guidelines to reproduce our results in the supplemental material.
Experiment Setup Yes The setting of hyper-parameters is presented in Appendix A. We investigate the sample efficiency (Sec. 4.2.1) followed by performance comparison under different settings of delays (Sec. 4.2.2). We also conduct the ablation study on the representation of VDPO (Sec. 4.2.3). Each method was run over 10 random seeds. The training curves can be found in the Appendix E.