Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems

Authors: Xuanming Zhang, Yuxuan Chen, Samuel (Min-Hsuan) Yeh, Sharon Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a comprehensive empirical evaluation of Meta Mind across a suite of challenging social intelligence benchmarks, including To M reasoning [19], social cognition, and social simulation [20] tasks. Our study spans over 16 contemporary LLMs, assessing both general social reasoning ability and performance in real-world, context-sensitive scenarios. Empirical results show that Meta Mind achieves a 35.7% average improvement on real social scenario tasks and a 9.0% average gain in overall social cognition ability substantially enhancing the social competence of underlying LLMs. Notably, our framework enables representative LLMs to match average human performance on key benchmarks. We also perform detailed ablation studies to isolate the contribution of each agent in the system, revealing that all three stages are critical to the framework s success.
Researcher Affiliation	Academia	Xuanming Zhang1, Yuxuan Chen2, Samuel Yeh1, Sharon Li1 1Uniersity of Wisconsin-Madison 2Tsinghua University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Hypothesis Selection
Open Source Code	Yes	Code is available at https://github.com/XMZhang AI/Meta Mind.
Open Datasets	Yes	We evaluate Meta Mind on four benchmarks: To MBench, STSS, Social IQA, and SOTOPIA. To MBench4 offers the most comprehensive multiple-choice evaluation of Theory-of-Mind... 4https://github.com/zhchen18/To MBench STSS5 is an action-level benchmark... 5https://github.com/wcx21/Social-Tasks-in-Sandbox-Simulation Social IQA6 probes models ability to infer motivations... 6https://huggingface.co/datasets/allenai/social_i_qa SOTOPIA7 is an open-ended role-play environment... 7https://huggingface.co/datasets/cmu-lti/sotopia
Dataset Splits	Yes	Following the original protocol, we evaluate performance on the full test set. We evaluate Meta Mind on the full test set, including the conversation-focused split, comprising 30 episodes (5 per category), and report the normalized success score. Following standard protocol, we evaluate Meta Mind on the full test set and report multiple-choice accuracy, using leaderboard-reported LLM performance as the baseline for comparison.
Hardware Specification	Yes	All numbers are tested on a single A100 80GB for 166.8 hours; batch size = 1.
Software Dependencies	Yes	We port the author-released JAX implementation to Python 3.11 and limit graph depth to second-order beliefs.
Experiment Setup	Yes	We conduct a comprehensive grid search to optimize Meta Mind s key parameters. Specifically, we sweep over the hypothesis size k {0, 1, . . . , 10}, the coefficient λ [0, 1] (in steps of 0.01), and the balance factor β [0, 1] (in steps of 0.01).2 We use GPT-4 as the underlying model and report overall accuracy on TOMBENCH as the evaluation metric. Final Configuration. To reduce inference overhead, we select the smaller window size k=6 for all experiments while fixing (λ, β) = (0.60, 0.80), which lies on the high-accuracy ridge close to the global optimum.