Off-Policy Proximal Policy Optimization
Authors: Wenjia Meng, Qian Zheng, Gang Pan, Yilong Yin
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, the experimental results on representative continuous control tasks validate that our method outperforms the state-of-the-art methods on most tasks. |
| Researcher Affiliation | Academia | Wenjia Meng1, Qian Zheng2,3, Gang Pan2,3, Yilong Yin1 1 School of Software, Shandong University, Jinan, China 2 The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou, China 3 College of Computer Science and Technology, Zhejiang University, Hangzhou, China |
| Pseudocode | Yes | Algorithm 1: Off-Policy PPO |
| Open Source Code | No | The paper mentions using existing open-source implementations for comparative methods but does not provide a link or explicit statement for the open-source code of their proposed Off-Policy PPO method. |
| Open Datasets | Yes | Experimental tasks consist of six representative continuous control tasks from Open AI Gym (Brockman et al. 2016) and Mu Jo Co (Todorov, Erez, and Tassa 2012), which cover simple and complex tasks: Swimmer, Hopper, Half Cheetah, Walker2d, Ant, and Humanoid. |
| Dataset Splits | No | The paper describes collecting transitions and sampling off-policy data for training and updating networks, but does not explicitly mention or specify a validation dataset split. |
| Hardware Specification | Yes | The experiments are performed on a GPU server that has four Nvidia RTX 3090. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and the ChainerRL implementation for DDPG, but does not provide specific version numbers for these or other key software components or libraries. |
| Experiment Setup | Yes | For hyperparameters, the trace-decay parameter λ is 0.95 and the discount factor γ is 0.99. The length of transitions (K) is set to be 1024. We use the Adam optimizer with learning rate α = 3 10 4. The epoch number N is 10. The minibatch size M is set to be 32. |