Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Balancing Sample Efficiency and Suboptimality in Inverse Reinforcement Learning

Authors: Angelo Damiani, Giorgio Manganini, Alberto Maria Metelli, Marcello Restelli

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide a numerical scheme for the optimization, and we show its effectiveness on illustrative numerical cases. ... We run a set of experiments in Linear Quadratic Gaussian (LQG) control problem (Dorato et al., 1994) and in the Mountain Car domain (Moore, 1990). ... In Figure 1, we show the values of the maximum Wasserstein distance f(η ) in (12) related to the change of the discount factor γ and the weights θ of the reward rθ... In Figure 2, we plot the learned parameter (top row) and the average discounted return (bottom row)... The effect of using the optimized IRL reward on the sample complexity of the forward learning problem is also depicted in Figure 3. ... Figure 4 shows the forward learning results obtained by REINFORCE on the two rewards and conﬁrms the properties discussed above.
Researcher Affiliation	Academia	1Department of Computer Science, Gran Sasso Science Institute, L Aquila, Italy 2Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy.
Pseudocode	No	The paper describes an "iterative procedure" with equations (14a, 14b, 14c) in Section 5 titled "Optimization Algorithm", but it is not presented in a formally labeled pseudocode block or algorithm environment.
Open Source Code	No	The paper does not contain any explicit statement about releasing code or a link to a code repository for the described methodology.
Open Datasets	Yes	We run a set of experiments in Linear Quadratic Gaussian (LQG) control problem (Dorato et al., 1994) and in the Mountain Car domain (Moore, 1990).
Dataset Splits	No	The paper mentions generating "N samples" and "M samples" for different phases of its approach (IRL task vs. forward RL) and refers to the Mountain Car domain, but it does not specify explicit train/validation/test dataset splits with percentages, absolute counts, or citations to predefined splits.
Hardware Specification	No	The paper does not specify any particular hardware used for running the experiments (e.g., GPU models, CPU models, memory specifications).
Software Dependencies	No	The paper mentions using REINFORCE (Williams, 1992) as a reinforcement learning algorithm, but it does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, specific library versions).
Experiment Setup	Yes	The Q-function feature vector is ψ(s, a) = [s2, a2, sa]... the reward features are set to φ(s, a) = [ s2 a2, QπE s (s, a)]... the policy is parametrized linearly in the state as πη(s) = ηs, and the reward weights θ are normalized to sum to 1. The dataset DLSTD... has been generated starting from 40 uniformly sampled states in the interval [ -1, 1] and following for H = 5 steps the expert policy, whose actions were corrupted by a white noise with standard deviation of 0.05. The dataset DIRL... has been set to 200 randomly sampled states... N = 200. Finally, we assumed to have an inﬁnite number of samples to solve the forward learning problem, and set M = . ... we select 20 uniformly random initial states and then estimate the gradient direction in the REINFORCE (Williams, 1992) algorithm by a Monte Carlo evaluation of the reward along trajectories of different lengths (we used H = 1 with the IRL reward and H {2, 6, 10} with the real one).