Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Policy Gradient Bayesian Robust Optimization for Imitation Learning

Authors: Zaynah Javed, Daniel S Brown, Satvik Sharma, Jerry Zhu, Ashwin Balakrishna, Marek Petrik, Anca Dragan, Ken Goldberg

ICML 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate PG-BROIL, we consider settings where there is uncertainty over the true reward function. We first examine the setting where we have an a priori distribution over reward functions and find that PG-BROIL is able to optimize policies that effectively trade-off between expected and worst-case performance. Then, we leverage recent advances in efficient Bayesian reward inference (Brown et al., 2020a) to infer a posterior over reward functions from preferences over demonstrated trajectories. ... Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator s reward function. 5. Experiments
Researcher Affiliation Academia 1EECS Department, University of California, Berkeley 2CS Department, University of New Hampshire. Correspondence to: Daniel Brown <EMAIL>.
Pseudocode Yes Algorithm 1 Policy Gradient BROIL
Open Source Code Yes Code and videos are available at https://sites.google. com/view/pg-broil.
Open Datasets Yes We study 3 domains: the classical Cart Pole benchmark (Brockman et al., 2016), a pointmass navigation task inspired by (Thananjeyan et al., 2020b) and a robotic reaching task from the from the DM Control Suite (Tassa et al., 2020).
Dataset Splits No No explicit mention of specific train/validation/test dataset splits (percentages, counts, or predefined citations) was found. The experimental setup describes policy training via rollouts and subsequent testing.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are provided.
Software Dependencies No The paper mentions Open AI Spinning Up, REINFORCE, and PPO, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For PG-BROIL, we set α = 0.95 and report results for the best λ (λ = 0.8). For PG-BROIL, we set α = 0.9 and report results for λ = 0.15. For PG-BROIL, we set α = 0.9 and report results for the best λ (λ = 0.3).