Bayesian Robust Optimization for Imitation Learning
Authors: Daniel Brown, Scott Niekum, Marek Petrik
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results show that BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors and outperforms existing risk-sensitive and risk-neutral inverse reinforcement learning algorithms. In the next two sections we explore two case studies that highlight the performance and benefits of using BROIL for robust policy optimization. We sampled 2000 reward functions from the prior distributions over costs and computed the CVa R optimal policy with α = 0.99 for different values of λ. Figure 5 shows that both formulations of BROIL significantly outperform Max Ent IRL and LPAL. |
| Researcher Affiliation | Academia | Daniel S. Brown UC Berkeley dsbrown@berkeley.edu Scott Niekum University of Texas at Austin sniekum@cs.utexas.edu Marek Petrik University of New Hampshire mpetrik@cs.unh.edu |
| Pseudocode | No | The paper includes mathematical formulations but does not contain structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code to reproduce experiments is available at https://github.com/dsbrown1331/broil |
| Open Datasets | No | The paper describes generating samples or using a single demonstration, but does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year) for a publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper mentions running experiments on "a personal laptop" but does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers. |
| Experiment Setup | Yes | We sampled 2000 reward functions from the prior distributions over costs and computed the CVa R optimal policy with α = 0.99 for different values of λ. Given the single demonstration, we generated 2000 samples from the posterior P(R | D) using Bayesian IRL [46]. We used a relatively small inverse temperature parameter (β = 10). |