Policy Gradient Bayesian Robust Optimization for Imitation Learning

Authors: Zaynah Javed, Daniel S Brown, Satvik Sharma, Jerry Zhu, Ashwin Balakrishna, Marek Petrik, Anca Dragan, Ken Goldberg

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate PG-BROIL, we consider settings where there is uncertainty over the true reward function. We first examine the setting where we have an a priori distribution over reward functions and find that PG-BROIL is able to optimize policies that effectively trade-off between expected and worst-case performance. Then, we leverage recent advances in efficient Bayesian reward inference (Brown et al., 2020a) to infer a posterior over reward functions from preferences over demonstrated trajectories. ... Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator s reward function. 5. Experiments
Researcher Affiliation Academia 1EECS Department, University of California, Berkeley 2CS Department, University of New Hampshire. Correspondence to: Daniel Brown <dsbrown@berkeley.edu>.
Pseudocode Yes Algorithm 1 Policy Gradient BROIL
Open Source Code Yes Code and videos are available at https://sites.google. com/view/pg-broil.
Open Datasets Yes We study 3 domains: the classical Cart Pole benchmark (Brockman et al., 2016), a pointmass navigation task inspired by (Thananjeyan et al., 2020b) and a robotic reaching task from the from the DM Control Suite (Tassa et al., 2020).
Dataset Splits No No explicit mention of specific train/validation/test dataset splits (percentages, counts, or predefined citations) was found. The experimental setup describes policy training via rollouts and subsequent testing.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are provided.
Software Dependencies No The paper mentions Open AI Spinning Up, REINFORCE, and PPO, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For PG-BROIL, we set α = 0.95 and report results for the best λ (λ = 0.8). For PG-BROIL, we set α = 0.9 and report results for λ = 0.15. For PG-BROIL, we set α = 0.9 and report results for the best λ (λ = 0.3).