PPO-Clip Attains Global Optimality: Towards Deeper Understandings of Clipping

Authors: Nai-Chieh Huang, Ping-Chun Hsieh, Kuo-Hao Ho, I-Chen Wu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the empirical evaluation of these variants in Section 6. Specifically, we evaluate Neural PPO-Clip, Neural PPO-Clip-sub (as introduced in Section 3), and two additional classifiers, log(πθ(a|s)) log(πθt(a|s)) and pρs,a(θ) 1(termed as Neural PPO-Clip-log and Neural PPO-Clip-root), against benchmark approaches in several RL benchmark environments. Our implementations of Neural PPO-Clip are based on the RL Baseline3 Zoo framework (Raffin 2020). We test the algorithms in both Min Atar (Young and Tian 2019) and Open AI Gym environments (Brockman et al. 2016). In addition, the algorithms are compared with popular baselines, including A2C and Rainbow. A2C follows the implementation and default settings from RL Baseline3 Zoo. For Rainbow, we adopt the configuration from (Ceron and Castro 2021). Please refer to Appendix G for more details about our experiment settings. Variants of Neural PPO-Clip Achieves Comparable Empirical Performance. Figure 1 shows the training curves of Neural PPO-Clip with various classifiers and the benchmark methods.
Researcher Affiliation Academia Nai-Chieh Huang, Ping-Chun Hsieh, Kuo-Hao Ho, I-Chen Wu Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Pseudocode Yes Algorithm 1: Neural PPO-Clip; Algorithm 2: EMDA
Open Source Code Yes All the code is available at https://github.com/NYCU-RL-BanditsLab/Neural-PPO-Clip and the full version is provided at https://arxiv.org/abs/2312.12065.
Open Datasets Yes We test the algorithms in both Min Atar (Young and Tian 2019) and Open AI Gym environments (Brockman et al. 2016).
Dataset Splits No The paper mentions using Min Atar and Open AI Gym environments, which are common benchmarks, but does not specify exact dataset splits (e.g., percentages for training, validation, or testing) or refer to standard predefined splits for these environments within the main text.
Hardware Specification No The paper does not explicitly describe the hardware used for running the experiments, such as specific GPU or CPU models, or cloud computing instance types.
Software Dependencies No The paper states: "Our implementations of Neural PPO-Clip are based on the RL Baseline3 Zoo framework (Raffin 2020)." While it names a framework, it does not provide specific version numbers for this framework or other key software dependencies like Python, PyTorch, etc.
Experiment Setup No The paper mentions adopting configurations from other works (e.g., "For Rainbow, we adopt the configuration from (Ceron and Castro 2021)") and refers to Appendix G for more details. However, it does not explicitly list specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or system-level training settings within the main text.