Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Authors: Jun Song, Niao He, Lijun Ding, Chaoyue Zhao

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across tabular domains, robotic locomotion, and continuous control tasks further demonstrate the performance improvement of both approaches, more robustness of WPO to sample insufficiency, and faster convergence of SPO, over state-of-art policy gradient methods.
Researcher Affiliation	Academia	Jun Song EMAIL Department of Industrial and Systems Engineering University of Washington, Niao He EMAIL Department of Computer Science ETH Zürich, Lijun Ding EMAIL Wisconsin Institute for Discovery University of Wisconsin Madison, Chaoyue Zhao EMAIL Department of Industrial and Systems Engineering University of Washington
Pseudocode	Yes	Algorithm 1: On-policy WPO/SPO algorithm
Open Source Code	Yes	The code of our WPO/SPO can be found here1. 1https://github.com/efficientwpo/Efficient WPO
Open Datasets	Yes	Our experiments include (1) ablation study that focuses on sensitivity analysis of WPO and SPO; (2) tabular domain tasks with discrete state and action including the Taxi, Chain, and Cliff Walking environments; (3) locomotion tasks with continuous state and discrete action including the Cart Pole, Acrobot environments; (4) comparison of KL and Wasserstein trust regions under tabular domain and locomotion tasks; and (5) extension to continuous control tasks with continuous action including Half Cheetah, Hopper, Walker, and Ant environments from Mu Ju Co.
Dataset Splits	No	The paper uses standard RL environments (e.g., Taxi, Cart Pole, Mu Ju Co tasks) where data is generated through interaction rather than being drawn from a fixed dataset with predefined splits. The text mentions 'Collect trajectory set Dk on policy πk' but does not specify traditional train/test/validation splits for reproduction.
Hardware Specification	No	The paper discusses 'training wall-clock time' in Section 7.3, but does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running the experiments. It only mentions runtimes without specifying the machines on which they were obtained.
Software Dependencies	No	We adopt the implementations of TRPO, PPO and A2C from Open AI Baselines (Dhariwal et al., 2017) for Mu Ju Co tasks and Stable Baselines (Hill et al., 2018) for other tasks. For BGPG, we adopt the same implementation2 as (Pacchiano et al., 2020). This text mentions third-party software but does not specify versions for the dependencies used in the authors' own WPO/SPO implementation.
Experiment Setup	Yes	Our main experimental results are reported in section 7. In addition, we provide the setting of hyperparameters and network sizes of our WPO/SPO algorithms in Table 3, and a summary of performance in Table 4. Table 3: Hyperparameters and network sizes (e.g., γ 0.9, lrvalue 10-2, π size 2D array, kβ 250 for Taxi-v3)