Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MisoDICE: Multi-Agent Imitation from Mixed-Quality Demonstrations

Authors: The Viet Bui, Tien Mai, Thanh Hong Nguyen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we conduct extensive experiments with comprehensive ablation studies on challenging multiagent environments, i.e., SMACv1 [46], and SMACv2 [12]. The results show that our Miso DICE outperforms all baselines, highlighting the benefit of its integrated approach, combining occupancy matching, value decomposition, and effective use of the multi-step labeling process.
Researcher Affiliation Academia The Viet Bui Singapore Management University, Singapore EMAIL Tien Mai Singapore Management University, Singapore EMAIL Thanh Hong Nguyen University of Oregon Eugene, Oregon, United States EMAIL
Pseudocode Yes Algorithm 1: Miso DICE Multi-Agent Imitation Policy Learning
Open Source Code Yes The data we used, along with our source code, has been uploaded with the main paper. We have also provided sufficient instructions for their use.
Open Datasets Yes We run various experiments on challenging MARL benchmarks, including the Star Craft Multi-Agent Challenge version 1 (SMACv1) [46], and its successor, SMACv2 [13]. In the appendix, we also include experiments on MAMu Jo Co [11]. Our approach leverages the offline datasets generated by O-MAPL [6]. The O-MAPL framework contributes a valuable resource by providing distinct datasets categorized by quality expert, medium, and poor for a range of MARL tasks.
Dataset Splits Yes To form this dataset, we sample 200 expert trajectories and 1000 poor trajectories from the O-MAPL datasets for each considered task. These selected trajectories are then combined and shuffled thoroughly to create a single, unlabeled suboptimal dataset. We evaluate Miso DICE s performance (in terms of returns and win rates) across a range of Kexpert values, specifically Kexpert {50, 200, 400, 800, 1200}, using the LLM-based trajectory ranking method.
Hardware Specification Yes All experiments are implemented using Py Torch and run in parallel on a single NVIDIA H100 NVL Tensor Core GPU.
Software Dependencies No All experiments are implemented using Py Torch and run in parallel on a single NVIDIA H100 NVL Tensor Core GPU. Given the large size of the offline datasets for each instance, we compress all datasets into the H5 format using the h5py library.
Experiment Setup Yes The hyperparameters used in our experiments are detailed in Table 3. Table 3: Hyperparameters used in Miso DICE experiments. Hyperparameter Value Optimizer Adam Learning rate (actor) 3e-4 (for SMAC) 1e-5 (for Ma Mujoco) Learning rate (critic) 3e-4 Tau (τ, soft update target rate) 0.005 Alpha (α) 0.05 Gamma (γ, discount factor) 0.99 Number of minibatch 512 Agent hidden dimension 256 Mixer hidden dimension 64 Number of seeds 4 Number of episodes for each evaluation step 32 Number of epochs 100