Reflective Policy Optimization

Authors: Yaozhong Gan, Renye Yan, Zhe Wu, Junliang Xing

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO s feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. To verify the effectiveness of the proposed RPO algorithm, we utilize several continuous and discrete environments from the Mu Jo Co (Todorov et al., 2012) and Atari games in Open AI Gym (Brockman et al., 2016) extensively adopted in previous works.
Researcher Affiliation Academia 1Qi Yuan Lab. Correspondence to: Junliang Xing <xingjunliang@qiyuanlab.com>.
Pseudocode Yes Algorithm 1 Reflective Policy Optimization (RPO)
Open Source Code Yes The source code of this work is available at https: //github.com/Edgargan/RPO.
Open Datasets Yes We utilize several continuous and discrete environments from the Mu Jo Co (Todorov et al., 2012) and Atari games in Open AI Gym (Brockman et al., 2016) extensively adopted in previous works.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits, which is common for static datasets in supervised learning. For reinforcement learning, data is generated through interaction with environments, and the paper describes evaluation procedures rather than dataset splitting percentages or counts.
Hardware Specification No The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU or CPU models, or cloud computing instance types.
Software Dependencies No The paper mentions using the Adam optimizer and basing experiments on existing codebases (Queeney et al., 2021; Zhang, 2018) but does not provide specific version numbers for key software components like Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup Yes For the experimental parameters, we use the default parameters from (Dhariwal et al., 2017; Henderson et al., 2018), for example, the discount factor is γ = 0.995, and we use the Adam optimizer (Kingma & Ba, 2015) throughout the training progress. The learning rate ϕ = 3e 4 except for Humanoid which is 1e 5. For RPO, the clipping parameters are ϵ = 0.2 and ϵ1 = 0.1, and the weighted parameter β = 0.3 on Mu Jo Co environments and do not fine-tune them.