Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling

Authors: Zhining Zhang, Chuanyang Jin, Mung Yao Jia, Shunchi Zhang, Tianmin Shu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated our method on multiple Theory of Mind benchmarks, including To Mi [26], Big To M [11], MMTo M-QA [20], Mu MA-To M [39], and Hi-To M [15]. The diversity and complexity of these benchmarks pose significant reasoning challenges. For instance, MMTo M-QA and Mu MA-To M incorporate both vision and language inputs, while Mu MA-To M and Hi-To M require higher-order inference.
Researcher Affiliation	Academia	Zhining Zhang Peking University EMAIL Chuanyang Jin Johns Hopkins University EMAIL Mung Yao Jia Johns Hopkins University EMAIL Shunchi Zhang Johns Hopkins University EMAIL Tianmin Shu Johns Hopkins University EMAIL
Pseudocode	Yes	Algorithm 1 Auto To M Require: Question Q, terminate threshold Umin 1: Automated Bayesian inverse planning 2: function BIP(M = (V ts:t, Xts:t), q) 3: Sample hypotheses for latent variables V ts:t 4: Conduct Bayesian inference via LLMs to compute P(q \|ts:t) Based on Eqn. (3) or Eqn. (4) 5: return P(q \| Xts:t) 6: end function 7: Automated Model Discovery 8: Extract query q from Q 9: Extract observable variables X1:t from Q 10: ts t 11: while ts 1 do 12: Propose initial V ts 13: M (V ts:t, Xts:t) 14: P(q \| Xts:t) BIP(M, q) 15: Compute the model utility U(M, q) 16: while V ts does not contain all mental variables do 17: vts new = arg maxv / V ts U(M + v, q) Based on results from BIP(M + v, q) 18: if U(M + vts new, q) > U(M, q) then 19: M M + vts new 20: P(q \| Xts:t) BIP(M, q) 21: else 22: Exit loop 23: end if 24: end while 25: if U(M, q) Umin then 26: Exit loop 27: else 28: ts ts 1 29: end if 30: end while 31: Return the answer A arg maxq P(q \| Xts:t)
Open Source Code	Yes	Links: Project Page \| Code
Open Datasets	Yes	We evaluated our method on multiple Theory of Mind benchmarks, including To Mi [26], Big To M [11], MMTo M-QA [20], Mu MA-To M [39], and Hi-To M [15].
Dataset Splits	Yes	We evaluated each method across 20 episodes, with 5 episodes in each task category. To reduce variance, the results are reported as the average over 3 runs per episode. [...] For Hi-To M, we choose the length 1 subset consisting of 200 questions across all orders (0-4) due to the high cost of testing the full dataset.
Hardware Specification	No	The paper does not explicitly provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using different LLM backends such as GPT-4o, Qwen3-235b, Deep Seek-chat-v3, and Gemini-2.5-Flash, but does not provide specific version numbers for any ancillary software libraries or programming languages used in their implementation.
Experiment Setup	Yes	In Algorithm 1, we configure the hyperparameters as follows: α = 0.02, Umin = 0.693. [...] During online mental inference, we maintained K = 5 particles for hypotheses and set a weight threshold of τ = 0.1 for particle filtering.