Lifelong Inverse Reinforcement Learning

Authors: Jorge Mendez, Shashank Shivkumar, Eric Eaton

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated ELIRL on two environments, chosen to allow us to create arbitrarily many tasks with distinct reward functions. This also gives us known rewards as ground truth. No previous multi-task IRL method was tested on such a large task set, nor on tasks with varying state spaces as we do.
Researcher Affiliation Academia Jorge A. Mendez, Shashank Shivkumar, and Eric Eaton Department of Computer and Information Science University of Pennsylvania {mendezme,shashs,eeaton}@seas.upenn.edu
Pseudocode Yes Algorithm 1 ELIRL (k, λ, µ)
Open Source Code No The paper mentions 'BURLAP Java library, version 3.0' [24] as a third-party tool but does not provide any statement or link indicating that the source code for the proposed ELIRL method itself is open-source or publicly available.
Open Datasets No The paper describes how it generated data using 'Objectworld' and 'Highway' simulations: 'We solved the MDP for the true optimal policy, and generated simulated user trajectories following this policy.' It does not refer to using a publicly available, pre-existing dataset with access information.
Dataset Splits No The paper specifies 'All learners were given nt = 32 trajectories for Objectworld and nt = 256 trajectories for Highway, all of length H = 16.' This specifies the amount of demonstration data provided for learning each task but does not describe conventional train/validation/test splits of a fixed dataset.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory specifications) used for running the experiments.
Software Dependencies Yes James Mac Glashan. Brown-UMBC reinforcement learning and planning (BURLAP) Java library, version 3.0. Available online at http://burlap.cs.brown.edu, 2016.
Experiment Setup Yes All learners were given nt = 32 trajectories for Objectworld and nt = 256 trajectories for Highway, all of length H = 16. ... The agent s chosen action has a 70% probability of success and a 30% probability of a random outcome. The reward is discounted with each time step by a factor of γ = 0.9.