Probabilistic Offline Policy Ranking with Approximate Bayesian Computation

Authors: Longchao Da, Porter Jenkins, Trevor Schwantes, Jeffrey Dotson, Hua Wei

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that POPR-EABC is adequate for evaluating policies in both discrete and continuous action spaces across various experiment environments, and facilitates probabilistic comparisons of candidate policies before deployment. We perform extensive evaluations comparing six baselines under different RL tasks covering both discrete and continuous action spaces. The results prove the effectiveness of POPR-EABC in offline policy evaluation.
Researcher Affiliation Academia Longchao Da1, Porter Jenkins2, Trevor Schwantes2, Jeffrey Dotson2, Hua Wei1* 1Arizona State University, 2Brigham Young University {longchao,hua.wei}@asu.edu, {pjenkins,jeff dotson}@cs.byu.edu, Schwantes2@gmail.com
Pseudocode Yes Algorithm 1: POPR-EABC Algorithm
Open Source Code Yes Detailed descriptions of the experiment and code can be found in the repository1. 1https://github.com/Longchao Da/POPR-EABC.git
Open Datasets Yes Then, we use POPR-EABC and baseline OPE algorithms to solve the OPR problem on widely-used complex environments with discrete or continuous action spaces in the Gym environment. The implementation of the policies is based on a public codebase (Raffin et al. 2021) 4. 4https://github.com/DLR-RM/stable-baselines3. All the policies are publicly available and well-trained by various RL algorithms, including DQN, QRDQN, TRPO, PPO, A2C, and ARS. (Raffin 2020) 5. 5https://github.com/DLR-RM/rl-baselines3-zoo
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning in terms of train/validation splits.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The implementation of the policies is based on a public codebase (Raffin et al. 2021) 4. 4https://github.com/DLR-RM/stable-baselines3 and All the policies are publicly available and well-trained by various RL algorithms, including DQN, QRDQN, TRPO, PPO, A2C, and ARS. (Raffin 2020) 5. 5https://github.com/DLR-RM/rl-baselines3-zoo. This mentions specific tools/frameworks but not with explicit version numbers for reproducibility beyond the year in the citation for stable-baselines3.
Experiment Setup Yes We execute the POPR-EABC algorithm with a burn-in period of B = 10 iterations, and N = 500 sampling iterations. Additionally, we set M = 5 for the number of bootstrapped samples at each iteration. We use a Beta(0.5, 0.5) prior, and a Beta proposal distribution with parameters, α = 4.0, and β = 1e 3.