Behavior Proximal Policy Optimization

Authors: Zifeng Zhuang, Kun LEI, Jinxin Liu, Donglin Wang, Yilang Guo

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the D4RL benchmark empirically show this extremely succinct method outperforms state-of-the-art offline RL algorithms.
Researcher Affiliation Academia Zifeng Zhuang12 Kun Lei2 Jinxin Liu2 Donglin Wang23 Yilang Guo4 1 Zhejiang University. 2 School of Engineering, Westlake University. 3 Institute of Advanced Technology, Westlake Institute for Advanced Study. 4 School of Software Engineering, Beijing Jiaotong University.
Pseudocode Yes Algorithm 1 Behavior Proximal Policy Optimization (BPPO)
Open Source Code Yes Our implementation is available at https://github.com/Dragon-Zhuang/BPPO.
Open Datasets Yes Extensive experiments on the D4RL benchmark (Fu et al., 2020) empirically shows that BPPO outperforms state-of-the-art offline RL algorithms.
Dataset Splits No The paper mentions using the D4RL benchmark but does not explicitly provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit statements of standard splits used).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions 'Our method is constructed by Pytorch (Paszke et al., 2019)' but does not provide specific version numbers for PyTorch or other ancillary software components.
Experiment Setup Yes Table 6: The selections of part of hyperparameters during policy improvement phase. This table provides specific hyperparameter values such as initial policy learning rate, initial clip ratio ϵ, and asymmetric coefficient ω.