Deep Inverse Q-learning with Constraints

Authors: Gabriel Kalweit, Maria Huegle, Moritz Werling, Joschka Boedecker

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the resulting algorithms called Inverse Action-value Iteration, Inverse Q-learning and Deep Inverse Qlearning on the Objectworld benchmark, showing a speedup of up to several orders of magnitude compared to (Deep) Max-Entropy algorithms. We further apply Deep Constrained Inverse Q-learning on the task of learning autonomous lane-changes in the open-source simulator SUMO achieving competent driving after training on data corresponding to 30 minutes of demonstrations.
Researcher Affiliation Collaboration Gabriel Kalweit Neurorobotics Lab University of Freiburg kalweitg@cs.uni-freiburg.de Maria Huegle Neurorobotics Lab University of Freiburg hueglem@cs.uni-freiburg.de Moritz Werling BMWGroup Germany Moritz.Werling@bmw.de Joschka Boedecker Neurorobotics Lab and Brain Links-Brain Tools University of Freiburg jboedeck@cs.uni-freiburg.de
Pseudocode Yes Algorithm 1: Tabular Inverse Q-learning; Algorithm 2: Fixed Batch Deep Inverse Q-learning
Open Source Code No The paper does not provide concrete access to its own source code (e.g., a specific repository link or an explicit code release statement) for the methodology described.
Open Datasets Yes We evaluate the performance of IAVI, IQL and DIQL on the common IRL Objectworld benchmark (Figure 2a) and compare to Max Ent IRL [26]... The Objectworld environment [16] is an N N map, where an agent chooses between going up, down, left or right or to stay in place per time step.
Dataset Splits No The paper mentions training runs and evaluation scenarios, but does not provide specific details on how the dataset was split into training, validation, and test sets (e.g., percentages or exact counts) to allow for reproducible data partitioning.
Hardware Specification Yes Resulting expected value difference and time needed until convergence, mean and standard deviation over 5 training runs on a 3.00 GHz CPU (bottom).
Software Dependencies No The paper mentions using the "open-source traffic simulator SUMO [14]" but does not provide specific version numbers for SUMO or any other key software libraries or dependencies used in their implementation.
Experiment Setup Yes Architectures and hyperparameters are shown in the appendix. We train on highway scenarios with a 1000 m three-lanes highway and random numbers of vehicles and driver types. We trained DCIQL on 5 * 10^4 samples of the expert for 10^5 iterations.