Balancing Sample Efficiency and Suboptimality in Inverse Reinforcement Learning
Authors: Angelo Damiani, Giorgio Manganini, Alberto Maria Metelli, Marcello Restelli
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a numerical scheme for the optimization, and we show its effectiveness on illustrative numerical cases. ... We run a set of experiments in Linear Quadratic Gaussian (LQG) control problem (Dorato et al., 1994) and in the Mountain Car domain (Moore, 1990). ... In Figure 1, we show the values of the maximum Wasserstein distance f(η ) in (12) related to the change of the discount factor γ and the weights θ of the reward rθ... In Figure 2, we plot the learned parameter (top row) and the average discounted return (bottom row)... The effect of using the optimized IRL reward on the sample complexity of the forward learning problem is also depicted in Figure 3. ... Figure 4 shows the forward learning results obtained by REINFORCE on the two rewards and confirms the properties discussed above. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Gran Sasso Science Institute, L Aquila, Italy 2Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy. |
| Pseudocode | No | The paper describes an "iterative procedure" with equations (14a, 14b, 14c) in Section 5 titled "Optimization Algorithm", but it is not presented in a formally labeled pseudocode block or algorithm environment. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We run a set of experiments in Linear Quadratic Gaussian (LQG) control problem (Dorato et al., 1994) and in the Mountain Car domain (Moore, 1990). |
| Dataset Splits | No | The paper mentions generating "N samples" and "M samples" for different phases of its approach (IRL task vs. forward RL) and refers to the Mountain Car domain, but it does not specify explicit train/validation/test dataset splits with percentages, absolute counts, or citations to predefined splits. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments (e.g., GPU models, CPU models, memory specifications). |
| Software Dependencies | No | The paper mentions using REINFORCE (Williams, 1992) as a reinforcement learning algorithm, but it does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, specific library versions). |
| Experiment Setup | Yes | The Q-function feature vector is ψ(s, a) = [s2, a2, sa]... the reward features are set to φ(s, a) = [ s2 a2, QπE s (s, a)]... the policy is parametrized linearly in the state as πη(s) = ηs, and the reward weights θ are normalized to sum to 1. The dataset DLSTD... has been generated starting from 40 uniformly sampled states in the interval [ -1, 1] and following for H = 5 steps the expert policy, whose actions were corrupted by a white noise with standard deviation of 0.05. The dataset DIRL... has been set to 200 randomly sampled states... N = 200. Finally, we assumed to have an infinite number of samples to solve the forward learning problem, and set M = . ... we select 20 uniformly random initial states and then estimate the gradient direction in the REINFORCE (Williams, 1992) algorithm by a Monte Carlo evaluation of the reward along trajectories of different lengths (we used H = 1 with the IRL reward and H {2, 6, 10} with the real one). |