reproducibilityindex.ai

Neural Reasoning about Agents’ Goals, Preferences, and Actions

Authors: Matteo Bortoletto, Lei Shi, Andreas Bulling

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When evaluated on the challenging Baby Intuitions Benchmark, IRENE achieves new state-of-the-art performance on three out of its five tasks with up to 48.9 % improvement. Our analyses demonstrate the effectiveness of IRENE in combining prior knowledge gained during training for unseen evaluation tasks. We also investigate, for the first time, the influence of the training tasks on test performance.
Researcher Affiliation	Academia	University of Stuttgart, Germany {matteo.bortoletto, lei.shi, andreas.bulling}@vis.uni-stuttgart.de
Pseudocode	No	The paper describes the model architecture and its components in detail within the "Method" section and Figure 2, but it does not include any specific pseudocode blocks or algorithms labeled as such.
Open Source Code	No	The paper does not contain any explicit statement about releasing open-source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	Inspired by behavioural experiments with infants, Gandhi et al. have recently introduced the Baby Intuitions Benchmark (Gandhi et al. 2021, BIB) a set of tasks that require an observer model to reason about agents goals, preferences, and actions by observing their behaviour in a gridworld environment.
Dataset Splits	No	The paper mentions "training" and "evaluation" sets and familiarisation/test trials within the Baby Intuitions Benchmark context. However, it does not explicitly define or specify a separate "validation" dataset split or provide percentages/counts for such a split for hyperparameter tuning or early stopping criteria.
Hardware Specification	No	The paper describes the model architecture and training details (e.g., hidden dimensions, activation functions) but does not provide any specific hardware specifications such as GPU models, CPU types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper mentions architectural components like "Graph SAGE layers" and "transformer encoder" and cites related works, but it does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	IRENE s feature fusion module encodes the node features using linear layers of hidden dimension 96. The state encoder consists of two Graph SAGE layers for each relation, with hidden dimension 96 and ELU activation (Clevert, Unterthiner, and Hochreiter 2015). The transformer encoder consists of a stack of six layers with four attention heads, feedforward dimension 512 and GELU activations (Hendrycks and Gimpel 2016). The prediction net uses the same GNN and feature fusion module used in the context encoder. The MLP policy has hidden dimensions 256, 128 and 256 and output dimension two, corresponding to the (x, y) coordinates of the agent in the next frame. Additional training details are reported in the Appendix.