Behavior Proximal Policy Optimization
Authors: Zifeng Zhuang, Kun LEI, Jinxin Liu, Donglin Wang, Yilang Guo
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the D4RL benchmark empirically show this extremely succinct method outperforms state-of-the-art offline RL algorithms. |
| Researcher Affiliation | Academia | Zifeng Zhuang12 Kun Lei2 Jinxin Liu2 Donglin Wang23 Yilang Guo4 1 Zhejiang University. 2 School of Engineering, Westlake University. 3 Institute of Advanced Technology, Westlake Institute for Advanced Study. 4 School of Software Engineering, Beijing Jiaotong University. |
| Pseudocode | Yes | Algorithm 1 Behavior Proximal Policy Optimization (BPPO) |
| Open Source Code | Yes | Our implementation is available at https://github.com/Dragon-Zhuang/BPPO. |
| Open Datasets | Yes | Extensive experiments on the D4RL benchmark (Fu et al., 2020) empirically shows that BPPO outperforms state-of-the-art offline RL algorithms. |
| Dataset Splits | No | The paper mentions using the D4RL benchmark but does not explicitly provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit statements of standard splits used). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Our method is constructed by Pytorch (Paszke et al., 2019)' but does not provide specific version numbers for PyTorch or other ancillary software components. |
| Experiment Setup | Yes | Table 6: The selections of part of hyperparameters during policy improvement phase. This table provides specific hyperparameter values such as initial policy learning rate, initial clip ratio ϵ, and asymmetric coefficient ω. |