Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Authors: Jun Song, Niao He, Lijun Ding, Chaoyue Zhao

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across tabular domains, robotic locomotion, and continuous control tasks further demonstrate the performance improvement of both approaches, more robustness of WPO to sample insufficiency, and faster convergence of SPO, over state-of-art policy gradient methods.
Researcher Affiliation Academia Jun Song EMAIL Department of Industrial and Systems Engineering University of Washington, Niao He EMAIL Department of Computer Science ETH Zürich, Lijun Ding EMAIL Wisconsin Institute for Discovery University of Wisconsin Madison, Chaoyue Zhao EMAIL Department of Industrial and Systems Engineering University of Washington
Pseudocode Yes Algorithm 1: On-policy WPO/SPO algorithm
Open Source Code Yes The code of our WPO/SPO can be found here1. 1https://github.com/efficientwpo/Efficient WPO
Open Datasets Yes Our experiments include (1) ablation study that focuses on sensitivity analysis of WPO and SPO; (2) tabular domain tasks with discrete state and action including the Taxi, Chain, and Cliff Walking environments; (3) locomotion tasks with continuous state and discrete action including the Cart Pole, Acrobot environments; (4) comparison of KL and Wasserstein trust regions under tabular domain and locomotion tasks; and (5) extension to continuous control tasks with continuous action including Half Cheetah, Hopper, Walker, and Ant environments from Mu Ju Co.
Dataset Splits No The paper uses standard RL environments (e.g., Taxi, Cart Pole, Mu Ju Co tasks) where data is generated through interaction rather than being drawn from a fixed dataset with predefined splits. The text mentions 'Collect trajectory set Dk on policy πk' but does not specify traditional train/test/validation splits for reproduction.
Hardware Specification No The paper discusses 'training wall-clock time' in Section 7.3, but does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running the experiments. It only mentions runtimes without specifying the machines on which they were obtained.
Software Dependencies No We adopt the implementations of TRPO, PPO and A2C from Open AI Baselines (Dhariwal et al., 2017) for Mu Ju Co tasks and Stable Baselines (Hill et al., 2018) for other tasks. For BGPG, we adopt the same implementation2 as (Pacchiano et al., 2020). This text mentions third-party software but does not specify versions for the dependencies used in the authors' own WPO/SPO implementation.
Experiment Setup Yes Our main experimental results are reported in section 7. In addition, we provide the setting of hyperparameters and network sizes of our WPO/SPO algorithms in Table 3, and a summary of performance in Table 4. Table 3: Hyperparameters and network sizes (e.g., γ 0.9, lrvalue 10-2, π size 2D array, kβ 250 for Taxi-v3)