Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DPAIL: Training Diffusion Policy for Adversarial Imitation Learning without Policy Optimization
Authors: Yunseon Choi, Minchan Jeong, Soobin Um, Kee-Eung Kim
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive quantitative and qualitative evaluations against various baselines, we demonstrate that our method not only captures diverse behaviors but also remains robust as the number of behavior modes increases. 5 Experiments In this section, we evaluate our method across navigation and control tasks, including Maze2d and Mu Jo Co environments. We begin with quantitative results that demonstrate its effectiveness at modeling multi-modal expert demonstrations. Next, we provide qualitative trajectory visualization to illustrate its ability to reproduce diverse behaviors. We then investigate how performance varies with the size of the expert dataset and the number of behavior modes. Finally, we wrap up with the analysis on the effects of the trajectory horizon H and the number of diffusion sampling steps N. Table 1: Normalized score (Score) and entropy (Entropy) for Mu Jo Co and Maze2d tasks. Each experiments is conducted using 5 different random seeds, and we collect 50 episodes for each seed. We report the scores as mean standard error. |
| Researcher Affiliation | Academia | Yunseon Choi1 Minchan Jeong1 Soobin Um1,2 Kee-Eung Kim1 1Kim Jaechul Graduate School of AI, KAIST 2Department of AI, Kookmin University EMAIL |
| Pseudocode | Yes | Algorithm 1 DPAIL Input: expert trajectories DE = {τn}NE n=1 Randomly initialize pθ0 and set pθold pθ0 for k [0, . . . , K] do Algorithm 2 Action execution Algorithm 3 Sampling Algorithm 4 Adversarial Soft Advantage Fitting (ASAF) |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Appendix D |
| Open Datasets | Yes | For Maze2d environments, we utilize the D4RL [11] dataset to collect demonstrations consisting of trajectories from initial positions to goal positions. Specifically, we use 15 episodes for maze2d-medium-v1 and 30 episodes for maze2d-large-v1. 1https://github.com/Farama-Foundation/D4RL |
| Dataset Splits | Yes | For Mu Jo Co environments, we pre-train M expert policies using SAC, where each policy corresponds to one of M behavior modes. We then sample K sets of expert demonstrations using these pre-trained policies, with each set consisting of 10 trajectories in Mu Jo Co. For Maze2d environments, we utilize the D4RL [11] dataset to collect demonstrations consisting of trajectories from initial positions to goal positions. Specifically, we use 15 episodes for maze2d-medium-v1 and 30 episodes for maze2d-large-v1. Each experiment is conducted using 5 different random seeds, and we collect 50 episodes for each seed. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts) used for running its experiments in the main text or appendix. |
| Software Dependencies | No | We use PPO [27] to train policies and GAE(λ) to compute advantage in GAIL, Diff AIL, DRAIL and Info GAIL. The corresponding hyperparameters for PPO are provided in Table 3. At each k-th iteration, we perform m-steps rollout in the environment. Both Diffusion and DPAIL utilize the same U-Net architecture with residual blocks consisting of temporal convolution and group normalization, following [17] 2. We use N = 50 diffusion steps in both Diffusion and DPAIL for all tasks. Additionally, we normalize the state values before feeding them into the network. For GAIL, Diff AIL, DRAIL, and ASAF, we use a multi-layer perceptron (MLP) with two hidden layers of size [64, 64] for the Gaussian policy. We also normalize the state values before feeding them into the policy network. The discriminator in GAIL is an MLP with two hidden layers of size [100, 100]. The discriminator architectures of both Diff AIL and DRAIL are based on an MLP U-Net structure based on the official repository 3, and N = 50 diffusion steps. |
| Experiment Setup | Yes | D Implementation Details Policy gradient method We use PPO [27] to train policies and GAE(λ) to compute advantage in GAIL, Diff AIL, DRAIL and Info GAIL. The corresponding hyperparameters for PPO are provided in Table 3. At each k-th iteration, we perform m-steps rollout in the environment. The corresponding hyperparameter settings for each algorithm are provided in Table 2. Table 2: Hyperparameters used for baselines across various environments. Table 3: PPO training hyperparameters used for each task. |