Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Off-Policy Proximal Policy Optimization
Authors: Wenjia Meng, Qian Zheng, Gang Pan, Yilong Yin
AAAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, the experimental results on representative continuous control tasks validate that our method outperforms the state-of-the-art methods on most tasks. |
| Researcher Affiliation | Academia | Wenjia Meng1, Qian Zheng2,3, Gang Pan2,3, Yilong Yin1 1 School of Software, Shandong University, Jinan, China 2 The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou, China 3 College of Computer Science and Technology, Zhejiang University, Hangzhou, China |
| Pseudocode | Yes | Algorithm 1: Off-Policy PPO |
| Open Source Code | No | The paper mentions using existing open-source implementations for comparative methods but does not provide a link or explicit statement for the open-source code of their proposed Off-Policy PPO method. |
| Open Datasets | Yes | Experimental tasks consist of six representative continuous control tasks from Open AI Gym (Brockman et al. 2016) and Mu Jo Co (Todorov, Erez, and Tassa 2012), which cover simple and complex tasks: Swimmer, Hopper, Half Cheetah, Walker2d, Ant, and Humanoid. |
| Dataset Splits | No | The paper describes collecting transitions and sampling off-policy data for training and updating networks, but does not explicitly mention or specify a validation dataset split. |
| Hardware Specification | Yes | The experiments are performed on a GPU server that has four Nvidia RTX 3090. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and the ChainerRL implementation for DDPG, but does not provide specific version numbers for these or other key software components or libraries. |
| Experiment Setup | Yes | For hyperparameters, the trace-decay parameter λ is 0.95 and the discount factor γ is 0.99. The length of transitions (K) is set to be 1024. We use the Adam optimizer with learning rate α = 3 10 4. The epoch number N is 10. The minibatch size M is set to be 32. |