reproducibilityindex.ai

Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others

Authors: Kanishk Gandhi, Gala Stojnic, Brenden M. Lake, Moira R Dillon

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The Baby Intuitions Benchmark (BIB)1 challenges machines to predict the plausibility of an agent s behavior based on the underlying causes of its actions. Because BIB s content and paradigm are adopted from developmental cognitive science, BIB allows for direct comparison between human and machine performance. Nevertheless, recently proposed, deep-learning-based agency reasoning models fail to show infant-like reasoning, leaving BIB an open challenge. The models were trained on 80% of the background training episodes (training set), and the rest of the episodes were used for validation (validation set). A comparison of the MSE loss (on pixels for the video model and in the action space for the BC and RL models) on the training and validation sets indicated that the models had learned the training tasks successfully (see appendix C). The results of our baselines are presented in Table 1.
Researcher Affiliation	Academia	Kanishk Gandhi New York University Gala Stojnic New York University Brenden M. Lake New York University Moira R. Dillon New York University
Pseudocode	No	The paper does not contain explicit pseudocode or algorithm blocks.
Open Source Code	Yes	1The dataset and code are available here: https://kanishkgandhi.com/bib
Open Datasets	Yes	1The dataset and code are available here: https://kanishkgandhi.com/bib
Dataset Splits	Yes	The models were trained on 80% of the background training episodes (training set), and the rest of the episodes were used for validation (validation set).
Hardware Specification	No	The paper does not provide specific hardware details used for running experiments.
Software Dependencies	No	The paper mentions software components like 'Python', 'PyTorch', 'U-Net', 'LSTM', and 'MLP' but does not specify their version numbers.
Experiment Setup	No	The paper describes model architectures and general training approaches (e.g., 'trained on 80% of the background training episodes', 'encode the familiarization trials as context using either a bidirectional LSTM or an MLP'), but does not provide concrete hyperparameter values or detailed system-level training settings in the main text. It refers to Appendix C for 'full model specifications'.