Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-Agent Reinforcement Learning with Communication-Constrained Priors
Authors: Guang Yang, Tianpei Yang, Jingwen Qiao, Yanqing Wu, Jing Huo, Xingguo Chen, Yang Gao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments In this section, we evaluate the algorithm s effectiveness from three aspects: overall performance, the impact of communication-constrained priors (CCPs), and the role of Du-MIE for messages. The specific evaluation contents are as follows: (1) Overall Performance: This aims to validate the proposed algorithm s performance under different communication-constrained scenarios. (2) Impact of Communication Priors: This focuses on verifying the performance and properties of the proposed method when using different communication priors. (3) Role of Du-MIE for Messages: Through ablation experiments, this evaluation seeks to determine how this module impacts the learning of multi-agent policies under communication constraints. 5.1 Experimental Setup In the experimental setup, we integrate the proposed algorithm framework with MADDPG [17] to form Communication-Constrained MADDPG (CC-MADDPG) as the primary validation target. It is then compared with four baselines: MAIC [26], Full-Communication MADDPG (FC-MADDPG), Dropout-MADDPG, and the standard MADDPG, operating without inter-agent communication. We adopt the Multi-Agent Particle Environments (MPEs) [17] as benchmarks. |
| Researcher Affiliation | Academia | Guang Yang1 Tianpei Yang12 Jingwen Qiao2 Yanqing Wu2 Jing Huo1 Xingguo Chen4 Yang Gao123 1State Key Laboratory for Novel Software Technology, Nanjing University 2School of Intelligence Science and Technology, Nanjing University Accepted 3School of Network Security and Information Technology, Yi Li Normal University 4Nanjing University of Posts and Telecommunications EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Communication-Constrained MARL 1: Input: maximum episode length T, hyperparameters α and β to balance the effects of MI, update frequency k for Du-MIE, communication-constrained priors fθe. 2: Initialize: main network parameters in MARL θQ, θπ, corresponding target networks θ Q and θ π , JSD parameters θ1, CLUB parameters θ2. 3: Initialize: experience replay buffer D. 4: repeat 5: for t = 1 to T do 6: Get patial observation ot = {oi t}N i=1. 7: Get message Mt = {M i t}N i=1 8: Predict communication link status It = {ιi t}N i=1, ιi t = {ιji}j =i. 9: Execute joint actions at = {ai t}N i=1 via sampling ai t πi( |oi t, M i t). 10: Receive ot+1 = {oi t+1}N i=1, Mt+1 = {M i t+1}N i=1 and team reward rt. 11: Calculate the shaping reward rt, according to the equation (6). 12: end for 13: Store v = {ot, Mt, It, ot+1, Mt+1, at, rt}T t=1 to D. 14: Update Du-MIE with replay buffer D every k steps, according to the equation (5). 15: Update network parameters in MARL, θQ, θπ, according to the equations (7) and (8). 16: until reaching maximum training steps |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Our paper offers thorough methodological and parameter details, enabling reproduction of the main experimental results without the need for code. Comprehensive experimental setup and execution instructions are included to facilitate this process. |
| Open Datasets | Yes | We adopt the Multi-Agent Particle Environments (MPEs) [17] as benchmarks. |
| Dataset Splits | No | During the evaluation, the trained model is loaded and run for 100 episodes in various test environments. The training process of each episode is fixed to 25 time steps. The average episode cumulative reward and standard deviation of each algorithm are mainly recorded and compared. |
| Hardware Specification | Yes | In this experiment, an NVIDIA RTXA5000 24GB GPU was used. |
| Software Dependencies | No | The actor network of each agent is a neural network with two hidden layers, each with 64 neurons, activated with Re LU, and the output layer with tanh activation function to output actions. All agents share a centralized critic network, whose hidden layer structure is similar to the actor network. In the JSD network, communication messages and actions are passed through a single-layer encoder with 32 neurons, respectively, and the mutual information lower bound is estimated using Jensen-Shannon divergence. In the CLUB network, the middle layer has 32 neurons, activated with Re LU, and the output layer uses tanh activation to model the conditional distribution of lossy messages and actions. Adam optimizer is used for all networks. |
| Experiment Setup | Yes | A.1.2 Parameters Setting In this experiment, an NVIDIA RTXA5000 24GB GPU was used. The actor network of each agent is a neural network with two hidden layers, each with 64 neurons, activated with Re LU, and the output layer with tanh activation function to output actions. All agents share a centralized critic network, whose hidden layer structure is similar to the actor network. In the JSD network, communication messages and actions are passed through a single-layer encoder with 32 neurons, respectively, and the mutual information lower bound is estimated using Jensen-Shannon divergence. In the CLUB network, the middle layer has 32 neurons, activated with Re LU, and the output layer uses tanh activation to model the conditional distribution of lossy messages and actions. Adam optimizer is used for all networks. The learning rate of actor network, JSD network and CLUB network is 1 × 10^−4, the learning rate of critic network is 1 × 10^−3, the discount factor is set to 0.95, and the target network update rate is set to 0.01. The replay buffer size is 1 × 10^5, the message buffer size is 1 × 10^3, and the batch size is usually 1024 (in the Simple_Tag task, when the number of agents is 6 and 9, the batch size is adjusted to 512 to avoid the problem of CC-MADDPG training process exceeding GPU memory). The random seed is set to 1. The time step of each round is fixed to 25 steps. The total time step of training for all models is 4.0 × 10^6. When the total time steps exceed 1024, the model parameters are updated every 100 total time steps. For more effective action exploration, Ornstein-Uhlenbeck noise is added to the output actions of the actor network at the beginning of training with parameters θ = 0.15 and σ = 0.2. At the beginning of training, the noise scale decays linearly with the number of training rounds. |