Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Imitation Beyond Expectation Using Pluralistic Stochastic Dominance

Authors: Ali Farajzadeh, Danyal Saeed, Syed M Abbas, Rushit Shah, Aadirupa Saha, Brian D. Ziebart

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the benefits of pluralistic stochastic dominance (PSD) for imitation in both theory and practice. ... The main contributions of this paper are three-fold: ... First, PSD improves upon imitation learning methods ... Second, PSD-based imitation provides an alternative justification ... Finally, we introduce novel evaluation metrics ... 4 Experiments
Researcher Affiliation Academia Ali Farajzadeh, Danyal Saeed, Syed M. Abbas, Rushit Shah, Aadirupa Saha, Brian D. Ziebart Department of Computer Science University of Illinois Chicago Chicago, IL 60607 EMAIL
Pseudocode Yes Algorithm 1 Policy model update Input: M imitator samples {ξi}, N demonstrations { ξj}, policy/parameters πϕ, and learning rate η Output: Updated policy/parameters πϕ 1: Set Pπ(ξi) = 1 M 2: Solve OTsubdom given Pπ(ξi) and P π( ξj) (Def. 3.2) 3: Construct training signals {ai} from OT solution 4: Update model parameters ϕ using variables a from (10) or (11): ϕ ϕ + η PM i=1 ai ϕ log Pπ(ξi)
Open Source Code Yes Code available at https://github.com/Ali199776/PSD.
Open Datasets No Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The work does not utilize human subjects and uses simulated environments. Public code was used that is referenced. Public datasets were not used. Our code will be released along with the demonstrations it was tested on.
Dataset Splits Yes For Lava world experiments, for reporting the results comparing different approaches with different amounts of training data and frequency of imitator Pareto dominance, we have randomly split the whole set containing 24 demonstrations into two equal splits of training and testing 100 times, and the represented results are averaged.
Hardware Specification Yes For Point Bot, baseline experiments (GAIL, RAIL, Info GAIL, BC) were run on several different personal computers and the slowest one took less than 5 hours to converge (e.g. on a laptop with 2.6GHz 10-core CPU, 32GB RAM). ... Experiments for PSD were run on an in-house server with GPU acceleration (equipped with two Nvidia GTX 1080 Ti GPUs), taking close to 1 hour and 1.5 hours each for convergence with Pointbot and Reacher, respectively.
Software Dependencies No Our implementation builds upon Open AI Spinning Up2, PG-BROIL (Javed et al., 2021)3, and BROIL (Brown et al., 2020a)4 repositories. ... For solving the Quadratic Program (QP), we have used MOSEK optimizer...
Experiment Setup Yes For our policy network, we have used a Gaussian Multi-Layer Perceptron (MLP) with 4 hidden layers each having 64 neurons with the Tanh activation function. The network receives the agent s observations and produces a mean and standard deviation for each action dimension, and the agent takes actions by sampling from this Gaussian distribution. For optimization, we used the Adam optimizer with a learning rate of 3e 5, and for the subdominance calculation, we set the β parameter to 0.001. We solve the QP using MOSEK optimizer. Training goes on for 2000 iterations, and the best model is saved according to the lowest QP objective value. We use 10 demonstrations for training, they come from two main modes, with each mode having 5 demonstrations. During each iteration, we rollout 30 trajectories. For PSD-α , we have used Adam optimizer with a learning rate of 5e 4 for learning alpha values. Alpha values are initialized uniformly, sum to 1, and are always at least equal to 0.1. Training goes on for 4000 iterations and similarly the best model is saved.