AGENT: A Benchmark for Core Psychological Reasoning
Authors: Tianmin Shu, Abhishek Bhandwaldar, Chuang Gan, Kevin Smith, Shari Liu, Dan Gutfreund, Elizabeth Spelke, Joshua Tenenbaum, Tomer Ullman
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate AGENT with human-ratings, propose an evaluation protocol emphasizing generalization, and compare two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network. Our results suggest that to pass the designed tests of core intuitive psychology at human levels, a model must acquire or have built-in representations of how agents plan, combining utility computations and core knowledge of objects and physics. |
| Researcher Affiliation | Collaboration | 1Massachusetts Institute of Technology 2MIT-IBM Watson AI Lab 3Harvard University. |
| Pseudocode | No | The paper describes methods using text and mathematical formulations, and includes architectural diagrams (Figure 4, Figure 5), but does not provide pseudocode or a clearly labeled algorithm block. |
| Open Source Code | No | The paper states, 'The dataset and the supplementary material are available at https://www.tshu.io/AGENT.' and 'We plan to release the dataset and the code for data generation.' However, it does not explicitly state that the source code for the described methodologies (BIPa CK, To Mnet-G) is currently available or provide a direct link to it. 'Code for data generation' is not the same as the methodology code. |
| Open Datasets | Yes | We present AGENT (Action, Goal, Efficiency, co Nstraint, u Tility), a benchmark for core psychology reasoning... The dataset and the supplementary material are available at https://www.tshu.io/AGENT. |
| Dataset Splits | Yes | There are 8400 videos in AGENT. Each video lasts from 5.6 s to 25.2 s, with a frame rate of 35 fps. With these videos, we constructed 3360 trials in total, divided into 1920 training trials, 480 validation trials, and 960 testing trials (or 480 pairs of expected and surprising testing trials, where each pair shares the same familiarization video(s)). |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions 'TDW (Gan et al., 2020)' and 'Py Bullet; Coumans & Bai 2016 2019)' as simulation environments/physics engines, but does not provide specific version numbers for these or any other software dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | No | The paper mentions the loss function used for To Mnet-G and a parameter 'β = 0.2' for BIPa CK. However, it does not provide comprehensive specific details on experimental setup such as learning rates, batch sizes, optimizer settings, or number of epochs for training the models. |