reproducibilityindex.ai

AGENT: A Benchmark for Core Psychological Reasoning

Authors: Tianmin Shu, Abhishek Bhandwaldar, Chuang Gan, Kevin Smith, Shari Liu, Dan Gutfreund, Elizabeth Spelke, Joshua Tenenbaum, Tomer Ullman

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate AGENT with human-ratings, propose an evaluation protocol emphasizing generalization, and compare two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network. Our results suggest that to pass the designed tests of core intuitive psychology at human levels, a model must acquire or have built-in representations of how agents plan, combining utility computations and core knowledge of objects and physics.
Researcher Affiliation	Collaboration	1Massachusetts Institute of Technology 2MIT-IBM Watson AI Lab 3Harvard University.
Pseudocode	No	The paper describes methods using text and mathematical formulations, and includes architectural diagrams (Figure 4, Figure 5), but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code	No	The paper states, 'The dataset and the supplementary material are available at https://www.tshu.io/AGENT.' and 'We plan to release the dataset and the code for data generation.' However, it does not explicitly state that the source code for the described methodologies (BIPa CK, To Mnet-G) is currently available or provide a direct link to it. 'Code for data generation' is not the same as the methodology code.
Open Datasets	Yes	We present AGENT (Action, Goal, Efficiency, co Nstraint, u Tility), a benchmark for core psychology reasoning... The dataset and the supplementary material are available at https://www.tshu.io/AGENT.
Dataset Splits	Yes	There are 8400 videos in AGENT. Each video lasts from 5.6 s to 25.2 s, with a frame rate of 35 fps. With these videos, we constructed 3360 trials in total, divided into 1920 training trials, 480 validation trials, and 960 testing trials (or 480 pairs of expected and surprising testing trials, where each pair shares the same familiarization video(s)).
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or memory) used to run the experiments.
Software Dependencies	No	The paper mentions 'TDW (Gan et al., 2020)' and 'Py Bullet; Coumans & Bai 2016 2019)' as simulation environments/physics engines, but does not provide specific version numbers for these or any other software dependencies like programming languages or deep learning frameworks.
Experiment Setup	No	The paper mentions the loss function used for To Mnet-G and a parameter 'β = 0.2' for BIPa CK. However, it does not provide comprehensive specific details on experimental setup such as learning rates, batch sizes, optimizer settings, or number of epochs for training the models.