Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling

Authors: Zhining Zhang, Chuanyang Jin, Mung Yao Jia, Shunchi Zhang, Tianmin Shu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated our method on multiple Theory of Mind benchmarks, including To Mi [26], Big To M [11], MMTo M-QA [20], Mu MA-To M [39], and Hi-To M [15]. The diversity and complexity of these benchmarks pose significant reasoning challenges. For instance, MMTo M-QA and Mu MA-To M incorporate both vision and language inputs, while Mu MA-To M and Hi-To M require higher-order inference.
Researcher Affiliation Academia Zhining Zhang Peking University EMAIL Chuanyang Jin Johns Hopkins University EMAIL Mung Yao Jia Johns Hopkins University EMAIL Shunchi Zhang Johns Hopkins University EMAIL Tianmin Shu Johns Hopkins University EMAIL
Pseudocode Yes Algorithm 1 Auto To M Require: Question Q, terminate threshold Umin 1: Automated Bayesian inverse planning 2: function BIP(M = (V ts:t, Xts:t), q) 3: Sample hypotheses for latent variables V ts:t 4: Conduct Bayesian inference via LLMs to compute P(q |ts:t) Based on Eqn. (3) or Eqn. (4) 5: return P(q | Xts:t) 6: end function 7: Automated Model Discovery 8: Extract query q from Q 9: Extract observable variables X1:t from Q 10: ts t 11: while ts 1 do 12: Propose initial V ts 13: M (V ts:t, Xts:t) 14: P(q | Xts:t) BIP(M, q) 15: Compute the model utility U(M, q) 16: while V ts does not contain all mental variables do 17: vts new = arg maxv / V ts U(M + v, q) Based on results from BIP(M + v, q) 18: if U(M + vts new, q) > U(M, q) then 19: M M + vts new 20: P(q | Xts:t) BIP(M, q) 21: else 22: Exit loop 23: end if 24: end while 25: if U(M, q) Umin then 26: Exit loop 27: else 28: ts ts 1 29: end if 30: end while 31: Return the answer A arg maxq P(q | Xts:t)
Open Source Code Yes Links: Project Page | Code
Open Datasets Yes We evaluated our method on multiple Theory of Mind benchmarks, including To Mi [26], Big To M [11], MMTo M-QA [20], Mu MA-To M [39], and Hi-To M [15].
Dataset Splits Yes We evaluated each method across 20 episodes, with 5 episodes in each task category. To reduce variance, the results are reported as the average over 3 runs per episode. [...] For Hi-To M, we choose the length 1 subset consisting of 200 questions across all orders (0-4) due to the high cost of testing the full dataset.
Hardware Specification No The paper does not explicitly provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions using different LLM backends such as GPT-4o, Qwen3-235b, Deep Seek-chat-v3, and Gemini-2.5-Flash, but does not provide specific version numbers for any ancillary software libraries or programming languages used in their implementation.
Experiment Setup Yes In Algorithm 1, we configure the hyperparameters as follows: α = 0.02, Umin = 0.693. [...] During online mental inference, we maintained K = 5 particles for hypotheses and set a weight threshold of τ = 0.1 for particle filtering.