Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Policy Gradient Bayesian Robust Optimization for Imitation Learning
Authors: Zaynah Javed, Daniel S Brown, Satvik Sharma, Jerry Zhu, Ashwin Balakrishna, Marek Petrik, Anca Dragan, Ken Goldberg
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate PG-BROIL, we consider settings where there is uncertainty over the true reward function. We first examine the setting where we have an a priori distribution over reward functions and find that PG-BROIL is able to optimize policies that effectively trade-off between expected and worst-case performance. Then, we leverage recent advances in efficient Bayesian reward inference (Brown et al., 2020a) to infer a posterior over reward functions from preferences over demonstrated trajectories. ... Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator s reward function. 5. Experiments |
| Researcher Affiliation | Academia | 1EECS Department, University of California, Berkeley 2CS Department, University of New Hampshire. Correspondence to: Daniel Brown <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Policy Gradient BROIL |
| Open Source Code | Yes | Code and videos are available at https://sites.google. com/view/pg-broil. |
| Open Datasets | Yes | We study 3 domains: the classical Cart Pole benchmark (Brockman et al., 2016), a pointmass navigation task inspired by (Thananjeyan et al., 2020b) and a robotic reaching task from the from the DM Control Suite (Tassa et al., 2020). |
| Dataset Splits | No | No explicit mention of specific train/validation/test dataset splits (percentages, counts, or predefined citations) was found. The experimental setup describes policy training via rollouts and subsequent testing. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are provided. |
| Software Dependencies | No | The paper mentions Open AI Spinning Up, REINFORCE, and PPO, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For PG-BROIL, we set α = 0.95 and report results for the best λ (λ = 0.8). For PG-BROIL, we set α = 0.9 and report results for λ = 0.15. For PG-BROIL, we set α = 0.9 and report results for the best λ (λ = 0.3). |