Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration

Authors: Yiyuan Pan, Zhe Liu, Hesheng Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate CERMIC on benchmark suites including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that exploration with CERMIC significantly outperforms So TA algorithms in sparse-reward environments.
Researcher Affiliation Academia Yiyuan Pan Zhe Liu Hesheng Wang Shanghai Jiao Tong University EMAIL Corresponding authors. The authors are with the School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, the Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240. Zhe Liu is also with the National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University.
Pseudocode Yes Algorithm 1 CERMIC
Open Source Code Yes Code: https://github.com/Pyy Will/CERMIC
Open Datasets Yes We evaluate our approach on a diverse set of MARL benchmarks: VMAS (9 tasks) [3], Melting Pot (4 tasks)[1], and SMACv2 (2 tasks) [9]. All environments were adapted to sparse-reward configurations to rigorously test exploration capabilities; specific modifications and implementation details are provided in Appendix E.
Dataset Splits No The paper mentions that environments were adapted to sparse-reward configurations (Appendix E) and discusses training and testing under different reward densities (Section 5.5) but does not provide specific details on training, validation, and test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for the underlying datasets.
Hardware Specification Yes Training/evaluation was conducted via Bench MARL suite on NVIDIA Quadro RTX 8000 GPUs.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x). While it mentions 'Bench MARL suite', it lacks concrete versioning information for these tools.
Experiment Setup Yes Table 6: Key training hyperparameters. Parameter Category Value / Setting General Optimization Discount Factor (γ) 0.99 Learning Rate (Adam) 1e-4 Adam Optimizer ϵ 1e-6 Target Network Updates Update Type Soft (Polyak averaging) Polyak Tau (τ) 0.005 Exploration Strategy Initial Epsilon (ϵinit) 0.8 Final Epsilon (ϵend) 0.01 Training Duration Max Frames 3000000 On-Policy Collection & Training Collected Frames per Batch 36000 Environments per Worker 60 Minibatch Iterations 30 Minibatch Size 2400 Off-Policy Collection & Training Optimizer Steps per Collection 1800 Train Batch Size 1024 Replay Buffer Size 1500000