BC-IRL: Learning Generalizable Reward Functions from Demonstrations
Authors: Andrew Szot, Amy Zhang, Dhruv Batra, Zsolt Kira, Franziska Meier
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate that state-of-the-art IRL algorithms, which maximize a maximum-entropy objective, learn rewards that overfit to the demonstrations. Such rewards struggle to provide meaningful rewards for states not covered by the demonstrations, a major detriment when using the reward to learn policies in new situations. We introduce BC-IRL, a new inverse reinforcement learning method that learns reward functions that generalize better when compared to maximum-entropy IRL approaches. In contrast to the Max Ent framework, which learns to maximize rewards around demonstrations, BC-IRL updates reward parameters such that the policy trained with the new reward matches the expert demonstrations better. We show that BC-IRL learns rewards that generalize better on an illustrative simple task and two continuous robotic control tasks, achieving over twice the success rate of baselines in challenging generalization settings. |
| Researcher Affiliation | Collaboration | Andrew Szot1,2, Amy Zhang1, Dhruv Batra1,2, Zsolt Kira2, Franziska Meier1 1Meta AI, 2Georgia Tech |
| Pseudocode | Yes | Algorithm 1 BC-IRL (general framework) ... Algorithm 2 BC-IRL-PPO |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating the release of source code for the described methodology. |
| Open Datasets | No | The paper describes using "expert demonstration trajectories De" and providing "100 demonstrations" or a "single demonstration" for specific tasks, but it does not provide concrete access information (link, DOI, repository, or formal citation) to a publicly available dataset for these demonstrations or for the environments used beyond citing the simulators themselves. |
| Dataset Splits | No | The paper describes 'Eval (Train)' and 'Eval (Test)' settings for evaluation, and also 'Easy', 'Medium', and 'Hard' test distributions, but it does not explicitly define a separate 'validation' dataset split with specific percentages or counts for model tuning during training. |
| Hardware Specification | Yes | All experiments were run on a Intel(R) Core(TM) i9-9900X CPU @ 3.50GHz. |
| Software Dependencies | No | The paper mentions "Adam Kingma & Ba (2014) was used for policy and reward optimization" but does not provide specific version numbers for Adam or any other software libraries or frameworks used in the implementation. |
| Experiment Setup | Yes | The hyperparameters for the methods from the 2D point navigation task in Section 4 are detailed in Table 9 for the no obstacle version and Table 10 for the obstacle version of the task. ... The hyperparameters for all methods from the Reaching task are described in Table 11. ... The hyperparameters for all methods for the Trifinger reaching task are described in Table 12. |