Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling

Authors: Jialong Zhou, Lichao Wang, Xiao Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate GUARDIAN s effectiveness in safeguarding LLM multi-agent collaborations against diverse safety vulnerabilities, achieving state-of-the-art accuracy with efficient resource utilization. The code is available at https://github.com/Jialong Zhou666/GUARDIAN. 5 Experiments In this section, we present experimental settings and results for hallucination amplification and error injection and propagation scenarios. Our temporal graph model naturally aligns with multi-agent collaboration under the A2A protocol [12], with nodes representing agents and edges representing standardized communications.
Researcher Affiliation	Academia	Jialong Zhou King s College London London, UK Lichao Wang Beijing Institute of Technology Beijing, China Xiao Yang Tsinghua University Beijing, China
Pseudocode	No	The paper describes the GUARDIAN framework with a 'Framework overview' in Figure 3, illustrating components like 'Graph Preprocessing', 'Attributed Graph Encoder', 'Time Information Encoder', 'Attribute Reconstruction Decoder', and 'Structure Reconstruction Decoder'. While these describe procedural steps, they are presented as a block diagram and textual descriptions rather than a formal pseudocode or algorithm block.
Open Source Code	Yes	Extensive experiments demonstrate GUARDIAN s effectiveness in safeguarding LLM multi-agent collaborations against diverse safety vulnerabilities, achieving state-of-the-art accuracy with efficient resource utilization. The code is available at https://github.com/Jialong Zhou666/GUARDIAN.
Open Datasets	Yes	Datasets. Our evaluation employs four benchmark datasets that span diverse domains and cognitive requirements. The benchmark datasets include MMLU [40], MATH [41], FEVER [42], and Biographies [1].
Dataset Splits	No	Following [11, 16], we randomly sample 100 questions from each dataset, with three independent testing iterations to ensure statistical robustness. We train a separate model for each dataset and apply the Incremental Training Paradigm within each dataset, focusing on in-distribution anomaly detection where the data distribution remains consistent across episodes.
Hardware Specification	No	The paper mentions using LLM models like "GPT-3.5-turbo [45], GPT-4o [46], Claude-3.5-sonnet [47], Llama3.1-8B [48])" but does not provide specific hardware details (e.g., GPU models, CPU types, or memory) for running the experiments or training the GUARDIAN model.
Software Dependencies	No	We use BERT [37] to transform agent responses rt,i into embeddings xt,i as node features in the graph structure. ... Attributed Graph Encoder that extends GCN capabilities [38]... Time Information Encoder that adapts Transformer mechanisms [39]... We evaluate models in a zero-shot Co T [44] setting using both closed-sourced and open-sourced models (GPT-3.5-turbo [45], GPT-4o [46], Claude-3.5-sonnet [47], Llama3.1-8B [48]). The paper mentions various tools and models but does not specify exact version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Implementation details. We evaluate models in a zero-shot Co T [44] setting using both closed-sourced and open-sourced models (GPT-3.5-turbo [45], GPT-4o [46], Claude-3.5-sonnet [47], Llama3.1-8B [48]), with all agents treated without role differentiation. We primarily test with 4 agents, conducting additional experiments with 3-7 agents. Runtime efficiency is evaluated under communication-targeted attacks using consistent experimental settings. Detailed prompts are provided in Appendix A.4. ... Ablation Studies. We investigate two crucial parameters as shown in Figure 6: α controls the balance between structural and attribute reconstruction, while γ regulates the compression-relevance trade-off in the information bottleneck (More details are presented in Appendix A.10). Using GPT-3.5-turbo with 4 agents, optimal performance is achieved when α [0.3, 0.5] and γ [0.001, 0.01].