reproducibilityindex.ai

Reflective Policy Optimization

Authors: Yaozhong Gan, Renye Yan, Zhe Wu, Junliang Xing

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO s feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. To verify the effectiveness of the proposed RPO algorithm, we utilize several continuous and discrete environments from the Mu Jo Co (Todorov et al., 2012) and Atari games in Open AI Gym (Brockman et al., 2016) extensively adopted in previous works.
Researcher Affiliation	Academia	1Qi Yuan Lab. Correspondence to: Junliang Xing <xingjunliang@qiyuanlab.com>.
Pseudocode	Yes	Algorithm 1 Reflective Policy Optimization (RPO)
Open Source Code	Yes	The source code of this work is available at https: //github.com/Edgargan/RPO.
Open Datasets	Yes	We utilize several continuous and discrete environments from the Mu Jo Co (Todorov et al., 2012) and Atari games in Open AI Gym (Brockman et al., 2016) extensively adopted in previous works.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits, which is common for static datasets in supervised learning. For reinforcement learning, data is generated through interaction with environments, and the paper describes evaluation procedures rather than dataset splitting percentages or counts.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU or CPU models, or cloud computing instance types.
Software Dependencies	No	The paper mentions using the Adam optimizer and basing experiments on existing codebases (Queeney et al., 2021; Zhang, 2018) but does not provide specific version numbers for key software components like Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup	Yes	For the experimental parameters, we use the default parameters from (Dhariwal et al., 2017; Henderson et al., 2018), for example, the discount factor is γ = 0.995, and we use the Adam optimizer (Kingma & Ba, 2015) throughout the training progress. The learning rate ϕ = 3e 4 except for Humanoid which is 1e 5. For RPO, the clipping parameters are ϵ = 0.2 and ϵ1 = 0.1, and the weighted parameter β = 0.3 on Mu Jo Co environments and do not fine-tune them.