Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others
Authors: Kanishk Gandhi, Gala Stojnic, Brenden M. Lake, Moira R Dillon
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The Baby Intuitions Benchmark (BIB)1 challenges machines to predict the plausibility of an agent s behavior based on the underlying causes of its actions. Because BIB s content and paradigm are adopted from developmental cognitive science, BIB allows for direct comparison between human and machine performance. Nevertheless, recently proposed, deep-learning-based agency reasoning models fail to show infant-like reasoning, leaving BIB an open challenge. The models were trained on 80% of the background training episodes (training set), and the rest of the episodes were used for validation (validation set). A comparison of the MSE loss (on pixels for the video model and in the action space for the BC and RL models) on the training and validation sets indicated that the models had learned the training tasks successfully (see appendix C). The results of our baselines are presented in Table 1. |
| Researcher Affiliation | Academia | Kanishk Gandhi New York University Gala Stojnic New York University Brenden M. Lake New York University Moira R. Dillon New York University |
| Pseudocode | No | The paper does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1The dataset and code are available here: https://kanishkgandhi.com/bib |
| Open Datasets | Yes | 1The dataset and code are available here: https://kanishkgandhi.com/bib |
| Dataset Splits | Yes | The models were trained on 80% of the background training episodes (training set), and the rest of the episodes were used for validation (validation set). |
| Hardware Specification | No | The paper does not provide specific hardware details used for running experiments. |
| Software Dependencies | No | The paper mentions software components like 'Python', 'PyTorch', 'U-Net', 'LSTM', and 'MLP' but does not specify their version numbers. |
| Experiment Setup | No | The paper describes model architectures and general training approaches (e.g., 'trained on 80% of the background training episodes', 'encode the familiarization trials as context using either a bidirectional LSTM or an MLP'), but does not provide concrete hyperparameter values or detailed system-level training settings in the main text. It refers to Appendix C for 'full model specifications'. |