Reflective Policy Optimization
Authors: Yaozhong Gan, Renye Yan, Zhe Wu, Junliang Xing
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO s feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. To verify the effectiveness of the proposed RPO algorithm, we utilize several continuous and discrete environments from the Mu Jo Co (Todorov et al., 2012) and Atari games in Open AI Gym (Brockman et al., 2016) extensively adopted in previous works. |
| Researcher Affiliation | Academia | 1Qi Yuan Lab. Correspondence to: Junliang Xing <xingjunliang@qiyuanlab.com>. |
| Pseudocode | Yes | Algorithm 1 Reflective Policy Optimization (RPO) |
| Open Source Code | Yes | The source code of this work is available at https: //github.com/Edgargan/RPO. |
| Open Datasets | Yes | We utilize several continuous and discrete environments from the Mu Jo Co (Todorov et al., 2012) and Atari games in Open AI Gym (Brockman et al., 2016) extensively adopted in previous works. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits, which is common for static datasets in supervised learning. For reinforcement learning, data is generated through interaction with environments, and the paper describes evaluation procedures rather than dataset splitting percentages or counts. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU or CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and basing experiments on existing codebases (Queeney et al., 2021; Zhang, 2018) but does not provide specific version numbers for key software components like Python, PyTorch/TensorFlow, or CUDA. |
| Experiment Setup | Yes | For the experimental parameters, we use the default parameters from (Dhariwal et al., 2017; Henderson et al., 2018), for example, the discount factor is γ = 0.995, and we use the Adam optimizer (Kingma & Ba, 2015) throughout the training progress. The learning rate ϕ = 3e 4 except for Humanoid which is 1e 5. For RPO, the clipping parameters are ϵ = 0.2 and ϵ1 = 0.1, and the weighted parameter β = 0.3 on Mu Jo Co environments and do not fine-tune them. |