Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bayesian Ego-graph Inference for Networked Multi-Agent Reinforcement Learning
Authors: Wei Duan, Jie Lu, Junyu Xuan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on both synthetic and real-world traffic control benchmarks show that Bayes G outperforms state-of-the-art MARL baselines in both performance and interpretability. Our main contributions are: We propose a stochastic graph-based policy for networked MARL, where each agent conditions decisions on a sampled subgraph over its physical neighbourhood. We formulate latent graph learning as Bayesian variational inference, treating edge masks as posterior distributions constrained by the environment topology and agent-local data. We develop an end-to-end training algorithm that integrates variational graph inference with actor critic learning via an ELBO objective. |
| Researcher Affiliation | Academia | Wei Duan Australian Artificial Intelligence Institute University of Technology Sydney Sydney, Australia EMAIL Jie Lu Australian Artificial Intelligence Institute University of Technology Sydney Sydney, Australia EMAIL Junyu Xuan Australian Artificial Intelligence Institute University of Technology Sydney Sydney, Australia EMAIL |
| Pseudocode | Yes | Algorithm 1 Bayes G: Multi-agent A2C Training with Variational Graph Inference Algorithm 2 Bayes G: Multi-agent Execution with Latent Graph Sampling |
| Open Source Code | Yes | Code and data are available at https://github.com/Wei9711/Bayes G. |
| Open Datasets | Yes | We evaluate Bayes G on five benchmark scenarios for adaptive traffic signal control (ATSC), implemented using the SUMO microscopic traffic simulator [38]. Each environment simulates peak-hour traffic... These networks are derived from real-world Manhattan layouts (see Appendix C for more details). Code and data are available at https://github.com/Wei9711/Bayes G. |
| Dataset Splits | No | The paper conducts experiments in simulated traffic control environments (ATSC_Grid, Monaco, New York33, New York51, and New York167). Instead of traditional dataset splits, the paper describes simulation setups and averaging over 5 random seeds for statistical significance. There are no explicit training/test/validation dataset splits mentioned for a fixed dataset as the environments are dynamic simulations. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | For all environments, we use a fixed control interval of 5 seconds. Policy and critic networks share similar architectures across all baselines. Each experiment is averaged over 5 random seeds. While any GNN can be employed, we use graph convolutional networks (GCNs) [42] for efficiency in our implementation. |
| Experiment Setup | Yes | For all environments, we use a fixed control interval of 5 seconds. Policy and critic networks share similar architectures across all baselines. Each experiment is averaged over 5 random seeds. We evaluate Bayes G on five benchmark scenarios for adaptive traffic signal control (ATSC), implemented using the SUMO microscopic traffic simulator [38]. Each environment simulates peak-hour traffic, with one MDP step corresponding to a fixed control interval. Agents observe traffic conditions via induction-loop detectors (ILDs), including vehicle density, queue lengths, and waiting times on incoming lanes, and control local traffic signals. The reward is the negative number of halted vehicles, normalized by a fixed scale. The ˆAπ i,τ = ˆRπ i,τ vi,τ is advantage estimate, where the reward is ˆRπ i,τ = PK 1 κ=0 γκ Pj Vi αdijrj,τ+κ + γKvi,τ+K,, and K denotes the rollout horizon. The vi,τ = Vω i( si,τ, u Ni,τ) is the target critic output. The α (0, 1] adjusts influence from distant neighbors, and β controls the entropy regularization. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The paper provides comprehensive implementation details in Section 5.1.2, including the number of training timesteps, control intervals, input feature construction, network architectures, and the use of consistent settings across baselines. Hyperparameters such as learning rates, rollout horizon, optimizer type (RMSprop), and entropy coefficients are documented. Additionally, environment-specific simulation details such as episode lengths, control intervals, and traffic statistics are described in the Experimental Setup section. Further details are included in the supplementary material to ensure transparency and reproducibility. |