Policy Gradient Bayesian Robust Optimization for Imitation Learning
Authors: Zaynah Javed, Daniel S Brown, Satvik Sharma, Jerry Zhu, Ashwin Balakrishna, Marek Petrik, Anca Dragan, Ken Goldberg
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate PG-BROIL, we consider settings where there is uncertainty over the true reward function. We first examine the setting where we have an a priori distribution over reward functions and find that PG-BROIL is able to optimize policies that effectively trade-off between expected and worst-case performance. Then, we leverage recent advances in efficient Bayesian reward inference (Brown et al., 2020a) to infer a posterior over reward functions from preferences over demonstrated trajectories. ... Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator s reward function. 5. Experiments |
| Researcher Affiliation | Academia | 1EECS Department, University of California, Berkeley 2CS Department, University of New Hampshire. Correspondence to: Daniel Brown <dsbrown@berkeley.edu>. |
| Pseudocode | Yes | Algorithm 1 Policy Gradient BROIL |
| Open Source Code | Yes | Code and videos are available at https://sites.google. com/view/pg-broil. |
| Open Datasets | Yes | We study 3 domains: the classical Cart Pole benchmark (Brockman et al., 2016), a pointmass navigation task inspired by (Thananjeyan et al., 2020b) and a robotic reaching task from the from the DM Control Suite (Tassa et al., 2020). |
| Dataset Splits | No | No explicit mention of specific train/validation/test dataset splits (percentages, counts, or predefined citations) was found. The experimental setup describes policy training via rollouts and subsequent testing. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are provided. |
| Software Dependencies | No | The paper mentions Open AI Spinning Up, REINFORCE, and PPO, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For PG-BROIL, we set α = 0.95 and report results for the best λ (λ = 0.8). For PG-BROIL, we set α = 0.9 and report results for λ = 0.15. For PG-BROIL, we set α = 0.9 and report results for the best λ (λ = 0.3). |