Inverse Reinforcement Learning through Policy Gradient Minimization

Authors: Matteo Pirotta, Marcello Restelli

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an empirical evaluation of the proposed approach on a multidimensional version of the Linear-Quadratic Regulator (LQR) both in the case where the parameters of the expert s policy are known and in the (more realistic) case where the parameters of the expert s policy need to be inferred from the expert s demonstrations. Finally, the algorithm is compared against the state-of-the-art on the mountain car domain, where the expert s policy is unknown.
Researcher Affiliation Academia Matteo Pirotta and Marcello Restelli Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano Piazza Leonardo da Vinci, 32 I-20133, Milan, Italy {matteo.pirotta, marcello.restelli}@polimi.it
Pseudocode No The paper describes algorithms mathematically and in prose but does not include a distinct pseudocode block or a section explicitly labeled 'Algorithm'.
Open Source Code No The paper does not contain any statement about making its source code publicly available, nor does it provide any links to a code repository.
Open Datasets Yes This section is devoted to the empirical analysis of the proposed algorithms. The first domain, a linear quadratic regulator, is used to illustrate the main characteristics of the proposed approach, while the mountain car domain is used to compare it against the most related approaches. and (Sutton et al. 1999) and (Pirotta, Parisi, and Restelli 2015)
Dataset Splits No The paper mentions generating "5 different datasets" and using various "numbers of samples" (e.g., "10, 100, and 1,000 trajectories") but does not provide specific train/validation/test dataset splits (e.g., percentages, sample counts for each split, or named predefined splits) that would allow reproduction of the data partitioning.
Hardware Specification No The paper does not provide any specific hardware details such as GPU or CPU models, processor types, or memory specifications used for running its experiments.
Software Dependencies No The paper mentions 'NLopt library' (http://ab-initio.mit.edu/nlopt) and algorithms like 'REINFORCE', 'GPOMDP', and 'LSPI', but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We consider three different mean parametrizations: linear in the state (i.e., the optimal one), with radial basis functions, and polynomial of degree 3. and We have imposed a maximum number of function evaluations to 500 for the convex optimization algorithm. and She selects a random action with probability 0.1. and We define the expert s policy as a Gibbs policy with linear approximation of the Q-function, a first degree polynomial over the state space is replicated for each action. and evenly-spaced hand-tuned 7 7 RBFs.