Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Thought Communication in Multiagent Collaboration

Authors: Yujia Zheng, Zhuokai Zhao, Zijian Li, Yaqi Xie, Mingze Gao, Lizhu Zhang, Kun Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on both synthetic and real-world benchmarks validate the theory and demonstrate the collaborative advantages of thought communication.
Researcher Affiliation	Collaboration	1 CMU 2 Meta AI 3 MBZUAI EMAIL EMAIL
Pseudocode	No	No explicit pseudocode or algorithm block is provided in the paper. The methodology is described in narrative form and supported by diagrams such as Figure 2: Overview of THOUGHTCOMM.
Open Source Code	No	The paper does not explicitly state that its own implementation code is open-source or provide a link to a code repository for the methodology described. It only mentions utilizing code from a baseline for comparison: "For baseline comparisons, we utilize the original code released by the authors" [Subramaniam et al., 2025].
Open Datasets	Yes	Therefore, we evaluate THOUGHTCOMM on two widely used math reasoning benchmarks, MATH [Hendrycks et al., 2021] and GSM8K [Cobbe et al., 2021] to assess its real-world effectiveness.
Dataset Splits	Yes	Following Subramaniam et al. [2025], we randomly sample 500 examples for fine-tuning the latent communication module, which includes both an autoencoder and an adapter, while reserving another 500 examples for evaluation.
Hardware Specification	Yes	All experiments are conducted on a single compute node with 8 NVIDIA H100 GPUs.
Software Dependencies	No	The paper does not provide specific version numbers for key software components, libraries, or frameworks used for implementation (e.g., PyTorch, Python, CUDA versions).
Experiment Setup	No	The paper mentions setting the prefix token count for the method to 1 in Implementation Details, and discusses varying prefix lengths from 1 to 16 in Section 5.4. However, it does not provide specific hyperparameters like learning rate, batch size, number of epochs, or optimizer settings for training the autoencoder and adapter.