Probabilistic Offline Policy Ranking with Approximate Bayesian Computation
Authors: Longchao Da, Porter Jenkins, Trevor Schwantes, Jeffrey Dotson, Hua Wei
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that POPR-EABC is adequate for evaluating policies in both discrete and continuous action spaces across various experiment environments, and facilitates probabilistic comparisons of candidate policies before deployment. We perform extensive evaluations comparing six baselines under different RL tasks covering both discrete and continuous action spaces. The results prove the effectiveness of POPR-EABC in offline policy evaluation. |
| Researcher Affiliation | Academia | Longchao Da1, Porter Jenkins2, Trevor Schwantes2, Jeffrey Dotson2, Hua Wei1* 1Arizona State University, 2Brigham Young University {longchao,hua.wei}@asu.edu, {pjenkins,jeff dotson}@cs.byu.edu, Schwantes2@gmail.com |
| Pseudocode | Yes | Algorithm 1: POPR-EABC Algorithm |
| Open Source Code | Yes | Detailed descriptions of the experiment and code can be found in the repository1. 1https://github.com/Longchao Da/POPR-EABC.git |
| Open Datasets | Yes | Then, we use POPR-EABC and baseline OPE algorithms to solve the OPR problem on widely-used complex environments with discrete or continuous action spaces in the Gym environment. The implementation of the policies is based on a public codebase (Raffin et al. 2021) 4. 4https://github.com/DLR-RM/stable-baselines3. All the policies are publicly available and well-trained by various RL algorithms, including DQN, QRDQN, TRPO, PPO, A2C, and ARS. (Raffin 2020) 5. 5https://github.com/DLR-RM/rl-baselines3-zoo |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning in terms of train/validation splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The implementation of the policies is based on a public codebase (Raffin et al. 2021) 4. 4https://github.com/DLR-RM/stable-baselines3 and All the policies are publicly available and well-trained by various RL algorithms, including DQN, QRDQN, TRPO, PPO, A2C, and ARS. (Raffin 2020) 5. 5https://github.com/DLR-RM/rl-baselines3-zoo. This mentions specific tools/frameworks but not with explicit version numbers for reproducibility beyond the year in the citation for stable-baselines3. |
| Experiment Setup | Yes | We execute the POPR-EABC algorithm with a burn-in period of B = 10 iterations, and N = 500 sampling iterations. Additionally, we set M = 5 for the number of bootstrapped samples at each iteration. We use a Beta(0.5, 0.5) prior, and a Beta proposal distribution with parameters, α = 4.0, and β = 1e 3. |