BC-IRL: Learning Generalizable Reward Functions from Demonstrations

Authors: Andrew Szot, Amy Zhang, Dhruv Batra, Zsolt Kira, Franziska Meier

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate that state-of-the-art IRL algorithms, which maximize a maximum-entropy objective, learn rewards that overfit to the demonstrations. Such rewards struggle to provide meaningful rewards for states not covered by the demonstrations, a major detriment when using the reward to learn policies in new situations. We introduce BC-IRL, a new inverse reinforcement learning method that learns reward functions that generalize better when compared to maximum-entropy IRL approaches. In contrast to the Max Ent framework, which learns to maximize rewards around demonstrations, BC-IRL updates reward parameters such that the policy trained with the new reward matches the expert demonstrations better. We show that BC-IRL learns rewards that generalize better on an illustrative simple task and two continuous robotic control tasks, achieving over twice the success rate of baselines in challenging generalization settings.
Researcher Affiliation Collaboration Andrew Szot1,2, Amy Zhang1, Dhruv Batra1,2, Zsolt Kira2, Franziska Meier1 1Meta AI, 2Georgia Tech
Pseudocode Yes Algorithm 1 BC-IRL (general framework) ... Algorithm 2 BC-IRL-PPO
Open Source Code No The paper does not contain an explicit statement or link indicating the release of source code for the described methodology.
Open Datasets No The paper describes using "expert demonstration trajectories De" and providing "100 demonstrations" or a "single demonstration" for specific tasks, but it does not provide concrete access information (link, DOI, repository, or formal citation) to a publicly available dataset for these demonstrations or for the environments used beyond citing the simulators themselves.
Dataset Splits No The paper describes 'Eval (Train)' and 'Eval (Test)' settings for evaluation, and also 'Easy', 'Medium', and 'Hard' test distributions, but it does not explicitly define a separate 'validation' dataset split with specific percentages or counts for model tuning during training.
Hardware Specification Yes All experiments were run on a Intel(R) Core(TM) i9-9900X CPU @ 3.50GHz.
Software Dependencies No The paper mentions "Adam Kingma & Ba (2014) was used for policy and reward optimization" but does not provide specific version numbers for Adam or any other software libraries or frameworks used in the implementation.
Experiment Setup Yes The hyperparameters for the methods from the 2D point navigation task in Section 4 are detailed in Table 9 for the no obstacle version and Table 10 for the obstacle version of the task. ... The hyperparameters for all methods from the Reaching task are described in Table 11. ... The hyperparameters for all methods for the Trifinger reaching task are described in Table 12.