Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
In-Context Fully Decentralized Cooperative Multi-Agent Reinforcement Learning
Authors: Chao Li, Bingkun BAO, Yang Gao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we evaluate RAC on various cooperative tasks (including matrix game, predator and prey, and SMAC), and its significant performance validates its effectiveness. Empirically, we evaluate RAC against multiple baselines across various cooperative tasks, including matrix game, predator and prey, and Star Craft Multi-Agent Challenge (SMAC) [24]. |
| Researcher Affiliation | Academia | 1 School of Computer Science, Nanjing University of Posts and Telecommunications 2 State Key Laboratory of Tibetan Intelligence 3 Peng Cheng Laboratory 4 School of Intelligent Science and Technology, Nanjing University 5 State Key Laboratory for Novel Software Technology, Nanjing University EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Return-Aware Context (RAC) |
| Open Source Code | Yes | We provide the source code of our algorithm in the supplementary materials. |
| Open Datasets | Yes | Empirically, we evaluate RAC against multiple baselines across various cooperative tasks, including matrix game, predator and prey, and Star Craft Multi-Agent Challenge (SMAC) [24]. |
| Dataset Splits | No | The paper does not explicitly mention training/test/validation dataset splits with percentages, sample counts, or specific predefined split methodologies. It discusses sampling batches of episodes from a replay buffer for training and reports test returns/win rates, but without detailing the structure of these test sets or how they are split from generated data in a reproducible manner. |
| Hardware Specification | Yes | The experiments are carried out on a server, which comprises a AMD EPYC 7542 32-Core Processor CPU, 504GB RAM, and 8 NVIDIA Ge Force RTX 4090 D GPUs. |
| Software Dependencies | No | We implement all algorithms based on the Py MARL framework. However, specific version numbers for PyMARL itself, programming languages, or other core libraries (e.g., Python, PyTorch/TensorFlow) are not provided. |
| Experiment Setup | Yes | The hyper-parameters of RAC across all tasks are provided in Tab. 4. In particular, m denotes the number of episodic return intervals, while β is a scaling factor balancing the TD loss and the supervision loss of Qi S. Tepsilon denotes the anneal time steps of ϵ when ϵ-greedy policy is used for exploration. Tmax represents the total number of training time steps, and Nmax is the size of the replay buffer. Nbatch represents the size of sampled batches per training. α is the learning rate and γ is the discounted factor. We decrease ϵ from ϵmax to ϵmin within Tepsilon time steps. In addition, we utilize RMSprop technique to update all networks of RAC using gradient descent. For the matrix game and predator and prey tasks, we instantiate Qi(st, cκ, ai t) using the Hyper-Net architecture. For the SMAC maps, we use the Normal-Net to implement Qi(st, cκ, ai t). |