Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^\pi$-Realizable MDPs
Authors: Antoine Moulin, Gergely Neu, Luca Viano
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of SPOIL is superior to behavior cloning and competitive with state-of-the-art algorithms. |
| Researcher Affiliation | Academia | Antoine Moulin Universitat Pompeu Fabra EMAIL Gergely Neu Universitat Pompeu Fabra EMAIL Luca Viano EPFL EMAIL |
| Pseudocode | Yes | Algorithm 1 SPOIL with linear FA Input: Number of expert trajectories τE, learning rate η, number of iterations K. Initialize: θ0 = 0, uniform policy π0. For k = 1, 2, . . . , K: 1. πk(a | x) πk 1(a | x)eη φ(x,a),θk 1 . 2. bgk = τ 1 E PτE i=1 φ Xi E, Ai E φ Xi E, πk . 3. θk = arg max θ: θ Bθ θ, bgk = Bθ Output: πout = πI, where I U([K]). Algorithm 2 SPOIL with general FA Input: Number of expert trajectories τE, learning rate η, number of iterations K. Initialize: Q0 = 0, uniform policy π0. For k = 1, 2, . . . , K: 1. πk(a | x) πk 1(a | x)eηQk 1(x,a). 2. Qk arg max Q Q b L(πk, Q). Output: πout = πI, where I U([K]). |
| Open Source Code | Yes | Code is available at: https://github.com/antoine-moulin/spoil. |
| Open Datasets | Yes | We run the general function approximation version of our algorithm in continuous-states environments from the gym library (Towers et al., 2025). In particular, we consider the environments Cart Pole-v1, Acrobot-v1 and Lunar Lander-v2 where the expert is trained via Soft DQN. We use the expert data provided in the code base of Garg et al. (2021). |
| Dataset Splits | No | The paper mentions subsampling trajectories to make the task more challenging ("the trajectories are subsampled each 20 steps in Cart Pole-v1, Acrobot-v1 and each 5 in Lunar Lander-v2"), but does not specify explicit train/test/validation splits for the data. It refers to using 'expert data' without detailing how that data itself is partitioned for evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details (like GPU/CPU models or memory). It only states in the NeurIPS Paper Checklist justification that "Our experiments are small scale and can be run on a laptop in within 1/2 days.", which is too general. |
| Software Dependencies | No | The implementations are built using Py Torch (Paszke et al., 2019). |
| Experiment Setup | Yes | For the experiments in Figure 2, algorithms are implemented using a shared neural network architecture consisting of 3 layers with 64 neurons per layer. This architecture matches the one used for experiments in the same environments by Garg et al. (2021). For behavioral cloning, we employ a separate three-layer multilayer perceptron with 128 neurons per layer. Implementations of IQ-Learn and P2IL utilize their original hyperparameter configurations as reported in their respective publications. All networks are optimized using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 5e-3 and default momentum parameters (β1 = 0.9, β2 = 0.999). The implementations are built using Py Torch (Paszke et al., 2019). For algorithms with a primal-dual structure (i.e., IQ-Learn, P2IL, and SPOIL), the policy update is performed using a Soft DQN-style update (c.f. Haarnoja et al., 2017) with a fixed temperature parameter. |