Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Provably Convergent Policy Optimization via Metric-aware Trust Region Methods
Authors: Jun Song, Niao He, Lijun Ding, Chaoyue Zhao
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across tabular domains, robotic locomotion, and continuous control tasks further demonstrate the performance improvement of both approaches, more robustness of WPO to sample insufficiency, and faster convergence of SPO, over state-of-art policy gradient methods. |
| Researcher Affiliation | Academia | Jun Song EMAIL Department of Industrial and Systems Engineering University of Washington, Niao He EMAIL Department of Computer Science ETH Zürich, Lijun Ding EMAIL Wisconsin Institute for Discovery University of Wisconsin Madison, Chaoyue Zhao EMAIL Department of Industrial and Systems Engineering University of Washington |
| Pseudocode | Yes | Algorithm 1: On-policy WPO/SPO algorithm |
| Open Source Code | Yes | The code of our WPO/SPO can be found here1. 1https://github.com/efficientwpo/Efficient WPO |
| Open Datasets | Yes | Our experiments include (1) ablation study that focuses on sensitivity analysis of WPO and SPO; (2) tabular domain tasks with discrete state and action including the Taxi, Chain, and Cliff Walking environments; (3) locomotion tasks with continuous state and discrete action including the Cart Pole, Acrobot environments; (4) comparison of KL and Wasserstein trust regions under tabular domain and locomotion tasks; and (5) extension to continuous control tasks with continuous action including Half Cheetah, Hopper, Walker, and Ant environments from Mu Ju Co. |
| Dataset Splits | No | The paper uses standard RL environments (e.g., Taxi, Cart Pole, Mu Ju Co tasks) where data is generated through interaction rather than being drawn from a fixed dataset with predefined splits. The text mentions 'Collect trajectory set Dk on policy πk' but does not specify traditional train/test/validation splits for reproduction. |
| Hardware Specification | No | The paper discusses 'training wall-clock time' in Section 7.3, but does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running the experiments. It only mentions runtimes without specifying the machines on which they were obtained. |
| Software Dependencies | No | We adopt the implementations of TRPO, PPO and A2C from Open AI Baselines (Dhariwal et al., 2017) for Mu Ju Co tasks and Stable Baselines (Hill et al., 2018) for other tasks. For BGPG, we adopt the same implementation2 as (Pacchiano et al., 2020). This text mentions third-party software but does not specify versions for the dependencies used in the authors' own WPO/SPO implementation. |
| Experiment Setup | Yes | Our main experimental results are reported in section 7. In addition, we provide the setting of hyperparameters and network sizes of our WPO/SPO algorithms in Table 3, and a summary of performance in Table 4. Table 3: Hyperparameters and network sizes (e.g., γ 0.9, lrvalue 10-2, π size 2D array, kβ 250 for Taxi-v3) |