Bayesian Robust Optimization for Imitation Learning

Authors: Daniel Brown, Scott Niekum, Marek Petrik

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results show that BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms. In the next two sections we explore two case studies that highlight the performance and benefits of using BROIL for robust policy optimization. We sampled 2000 reward functions from the prior distributions over costs and computed the CVa R optimal policy with α = 0.99 for different values of λ. Figure 5 shows that both formulations of BROIL significantly outperform Max Ent IRL and LPAL.
Researcher Affiliation Academia Daniel S. Brown UC Berkeley dsbrown@berkeley.edu Scott Niekum University of Texas at Austin sniekum@cs.utexas.edu Marek Petrik University of New Hampshire mpetrik@cs.unh.edu
Pseudocode No The paper includes mathematical formulations but does not contain structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code to reproduce experiments is available at https://github.com/dsbrown1331/broil
Open Datasets No The paper describes generating samples or using a single demonstration, but does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year) for a publicly available or open dataset.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification No The paper mentions running experiments on "a personal laptop" but does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers.
Experiment Setup Yes We sampled 2000 reward functions from the prior distributions over costs and computed the CVa R optimal policy with α = 0.99 for different values of λ. Given the single demonstration, we generated 2000 samples from the posterior P(R | D) using Bayesian IRL [46]. We used a relatively small inverse temperature parameter (β = 10).