Inverse Reinforcement Learning with Explicit Policy Estimates

Authors: Navyata Sanghvi, Shinnosuke Usami, Mohit Sharma, Joachim Groeger, Kris Kitani9472-9480

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using theory we developed in Section 4, in this section, we develop hypotheses about the hierarchy of methods in various types of problem situations, and investigate each hypothesis using an example. We quantitatively compare the methods using the following metrics (lower values are better): Negative Log Likelihood (NLL) evaluates the likelihood of the expert path under the predicted policy πθ , and is directly related to our objective (5). Expected Value Difference (EVD) is value difference of two policies under true reward: 1) optimal policy under true reward and 2) optimal policy under output reward θ . Stochastic EVD is the value difference of the following policies under true reward: 1) optimal policy under the true reward and 2) the output policy πθ . While a low Stochastic EVD may indicate a better output policy πθ , low EVD may indicate a better output reward θ . Equivalent-Policy Invariant Comparison (EPIC) (Gleave et al. 2020) is a recently developed metric that measures the distance between two reward functions without training a policy. EPIC is shown to be invariant on an equivalence class of reward functions that always induce the same optimal policy. The EPIC metric [0, 1] with lower values indicates similar reward functions. EVD and EPIC evaluate inferred reward θ , while NLL and Stochastic-EVD evaluate the inferred policy πθ . In addition
Researcher Affiliation Collaboration Navyata Sanghvi 1, Shinnosuke Usami 1,2, Mohit Sharma 1, Joachim Groeger 1, Kris Kitani1 1Carnegie Mellon University 2Sony Corporation
Pseudocode Yes Algorithm 1: Optimization-based Method
Open Source Code No The paper does not provide a specific repository link, explicit code release statement, or code in supplementary materials for the methodology described.
Open Datasets No The paper mentions environments like Obstacleworld, Mountain Car, and Objectworld, but it does not provide concrete access information (specific link, DOI, repository name, or formal citation with authors/year) for publicly available or open datasets.
Dataset Splits No The paper discusses "low data regimes" and "high data regimes" referring to the amount of expert data, but it does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup No The paper does not contain specific experimental setup details (concrete hyperparameter values, training configurations, or system-level settings) in the main text.