Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others

Authors: Kanishk Gandhi, Gala Stojnic, Brenden M. Lake, Moira R Dillon

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The Baby Intuitions Benchmark (BIB)1 challenges machines to predict the plausibility of an agent s behavior based on the underlying causes of its actions. Because BIB s content and paradigm are adopted from developmental cognitive science, BIB allows for direct comparison between human and machine performance. Nevertheless, recently proposed, deep-learning-based agency reasoning models fail to show infant-like reasoning, leaving BIB an open challenge. The models were trained on 80% of the background training episodes (training set), and the rest of the episodes were used for validation (validation set). A comparison of the MSE loss (on pixels for the video model and in the action space for the BC and RL models) on the training and validation sets indicated that the models had learned the training tasks successfully (see appendix C). The results of our baselines are presented in Table 1.
Researcher Affiliation Academia Kanishk Gandhi New York University Gala Stojnic New York University Brenden M. Lake New York University Moira R. Dillon New York University
Pseudocode No The paper does not contain explicit pseudocode or algorithm blocks.
Open Source Code Yes 1The dataset and code are available here: https://kanishkgandhi.com/bib
Open Datasets Yes 1The dataset and code are available here: https://kanishkgandhi.com/bib
Dataset Splits Yes The models were trained on 80% of the background training episodes (training set), and the rest of the episodes were used for validation (validation set).
Hardware Specification No The paper does not provide specific hardware details used for running experiments.
Software Dependencies No The paper mentions software components like 'Python', 'PyTorch', 'U-Net', 'LSTM', and 'MLP' but does not specify their version numbers.
Experiment Setup No The paper describes model architectures and general training approaches (e.g., 'trained on 80% of the background training episodes', 'encode the familiarization trials as context using either a bidirectional LSTM or an MLP'), but does not provide concrete hyperparameter values or detailed system-level training settings in the main text. It refers to Appendix C for 'full model specifications'.