Lifelong Inverse Reinforcement Learning
Authors: Jorge Mendez, Shashank Shivkumar, Eric Eaton
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated ELIRL on two environments, chosen to allow us to create arbitrarily many tasks with distinct reward functions. This also gives us known rewards as ground truth. No previous multi-task IRL method was tested on such a large task set, nor on tasks with varying state spaces as we do. |
| Researcher Affiliation | Academia | Jorge A. Mendez, Shashank Shivkumar, and Eric Eaton Department of Computer and Information Science University of Pennsylvania {mendezme,shashs,eeaton}@seas.upenn.edu |
| Pseudocode | Yes | Algorithm 1 ELIRL (k, λ, µ) |
| Open Source Code | No | The paper mentions 'BURLAP Java library, version 3.0' [24] as a third-party tool but does not provide any statement or link indicating that the source code for the proposed ELIRL method itself is open-source or publicly available. |
| Open Datasets | No | The paper describes how it generated data using 'Objectworld' and 'Highway' simulations: 'We solved the MDP for the true optimal policy, and generated simulated user trajectories following this policy.' It does not refer to using a publicly available, pre-existing dataset with access information. |
| Dataset Splits | No | The paper specifies 'All learners were given nt = 32 trajectories for Objectworld and nt = 256 trajectories for Highway, all of length H = 16.' This specifies the amount of demonstration data provided for learning each task but does not describe conventional train/validation/test splits of a fixed dataset. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory specifications) used for running the experiments. |
| Software Dependencies | Yes | James Mac Glashan. Brown-UMBC reinforcement learning and planning (BURLAP) Java library, version 3.0. Available online at http://burlap.cs.brown.edu, 2016. |
| Experiment Setup | Yes | All learners were given nt = 32 trajectories for Objectworld and nt = 256 trajectories for Highway, all of length H = 16. ... The agent s chosen action has a 70% probability of success and a 30% probability of a random outcome. The reward is discounted with each time step by a factor of γ = 0.9. |