Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems

Authors: Guibin Zhang, Muxin Fu, Kun Wang, Frank Wan, Miao Yu, Shuicheng Yan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to 20.89% and 10.12%, respectively, without any modifications to the original frameworks.
Researcher Affiliation	Academia	Guibin Zhang 1, Muxin Fu 2, Kun Wang3 , Guancheng Wan4, Miao Yu5, Shuicheng Yan1 1NUS, 2Tongji University, 3NTU, 4WHU, 5A*STAR Equal Contribution, Corresponding author # EMAIL, EMAIL
Pseudocode	No	The paper describes the G-Memory workflow in Section 4 using prose and mathematical formalizations. Appendix C provides 'Prompt Set' for LLM interaction but not general pseudocode or algorithm blocks for the overall methodology.
Open Source Code	Yes	Our codes are available at https://github.com/bingreeky/GMemory.
Open Datasets	Yes	ALFWorld [77] (available at https://alfworld.github.io/, MIT license) is a textbased embodied environment featuring household tasks, where agents navigate and interact with objects via natural language commands. Science World [78] (available at https://github.com/allenai/Science World, Apache-2.0 license) is another text-based embodied environment designed for interactive science tasks. PDDL is a game dataset from Agent Board [79] (available at https://github.com/ hkust-nlp/Agent Board, Custom properties), comprising a variety of strategic games where agents use PDDL expressions to complete complex tasks. Hotpot QA [75] (available at https://hotpotqa.github.io/, CC BY-SA 4.0 License) is a multi-hop question answering dataset with strong supervision on supporting facts. FEVER [76] (available at https://fever.ai/dataset/fever.html, Creative Commons Attribution-Share Alike License) is a knowledge-intensive dataset focused on fact verification.
Dataset Splits	No	There are no explicit dataset splits.
Hardware Specification	No	For instantiating these MAS frameworks, we adopt two open-source LLMs, Qwen-2.5-7b and Qwen-2.5-14b, as well as one proprietary LLM, gpt-4o-mini. The deployment of Qwen series is via local instantiation using Ollama1, and GPT models are accessed via Open AI APIs.
Software Dependencies	No	We implement the embedding function v( ) in Equation (4) with ALL-MINILM-L6-V2 [80]. The deployment of Qwen series is via local instantiation using Ollama1, and GPT models are accessed via Open AI APIs.
Experiment Setup	Yes	The number of the most relevant interaction graphs M in Equation (7) is set among {2, 3, 4, 5}, and the number of relevant queries k in Equation (4) is set among {1, 2}. The detailed ablation study on hyper-parameters is placed in Section 5.4.