Imitation Learning in Discounted Linear MDPs without exploration assumptions

Authors: Luca Viano, Stratis Skoulakis, Volkan Cevher

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments with linear function approximation shows that ILARL outperforms other commonly used algorithms. 6. Empirical evaluation We numerically verify the main theoretical insights derived in the previous sections (i) We aim to verify that for a general stochastic expert, the efficiency in terms of expert trajectories improves upon behavioural cloning. (ii) ILARL is more efficient in terms of MDP trajectories compared to PPIL (Viano et al., 2022) which has worst theoretical guarantees and with popular algorithms that are widely used in practice but do not enjoy theoretical guarantees: GAIL (Ho et al., 2016), AIRL (Fu et al., 2018), REIRL (Boularias et al., 2011) and IQLearn (Garg et al., 2021) The experiments are run in a continuous state MDP explained in Appendix G. Expert trajectory efficiency with stochastic expert For the first claim, we use a stochastic expert obtained following with equal probability either the action taken by a deterministic experts previously trained with LSVI-UCB or an action sampled uniformly at random. We collect with such policy τE trajectories. From Figure 1, we observe that all imitation learning we tried have a final performance improving over behavioural cloning for the case τE = 1 while only REIRL and ILARL do so for τE = 2. In both cases, ILARL achieves the highest return that even matches the expert performance. MDP trajectories efficiency For the second claim, we can see in Figure 1 that ILARL is the most efficient algorithm in terms of MDP trajectories for both values of τE.
Researcher Affiliation Academia Luca Viano 1 Stratis Skoulakis 1 Volkan Cevher 1 1LIONS, EPFL, Lausanne, Switzerland. Correspondence to: Luca Viano <luca.viano@epfl.ch>.
Pseudocode Yes Algorithm 1 On-policy MDP-E with unknown transitions and adversarial costs. Algorithm 2 Infinite Horizon Linear MDP with adversarial losses. Algorithm 3 Imitation Learning via Adversarial Reinforcement Learning (ILARL) for Infinite Horizon Linear MDPs. Algorithm 4 Best Response Imitation learnin G (BRIG).
Open Source Code No The paper does not provide any explicit statement about open-sourcing the code for its methodology, nor does it include a link to a code repository.
Open Datasets No The paper describes custom environments used for experiments ('continuous gridworld', 'linear bandits problem') in Appendix G but does not provide access information (link, citation, or repository) for a publicly available or open dataset.
Dataset Splits No The paper does not specify exact train/validation/test split percentages, sample counts, or refer to predefined splits with citations for reproducibility.
Hardware Specification No The paper does not specify any hardware details like GPU/CPU models, memory, or cloud resources used for running experiments.
Software Dependencies No For IQlearn, we also collect 5 trajectory to perform each update on the Q-function, and we use again η = 1 and 0.005 as stepsize for the Q-function weights. For PPIL, we use batches of 5 trajectories, 20 gradient updates between each batch collection, η = 1 and and 0.005 as stepsize for the Q-function weights. For GAIL and AIRL, we use the default hyperparameters in https://github.com/Khrylx/ Py Torch-RL but we obtained a better prerformance with a larger batch size of 6144 states and we use linear models rather than neural networks. For REIRL, we used the implementation in (Viano et al., 2021) but again we increased the batch size equal to 6144 states for achieving a better performance. No specific version numbers for software libraries (e.g., PyTorch version) are provided.
Experiment Setup Yes For the experiments in Figures 1 and 2 we used η = 1, τ = 5 and β = 8. For IQlearn, we also collect 5 trajectory to perform each update on the Q-function, and we use again η = 1 and 0.005 as stepsize for the Q-function weights. For PPIL, we use batches of 5 trajectories, 20 gradient updates between each batch collection, η = 1 and and 0.005 as stepsize for the Q-function weights. For GAIL and AIRL, we use the default hyperparameters in https://github.com/Khrylx/ Py Torch-RL but we obtained a better prerformance with a larger batch size of 6144 states and we use linear models rather than neural networks. For REIRL, we used the implementation in (Viano et al., 2021) but again we increased the batch size equal to 6144 states for achieving a better performance.