reproducibilityindex.ai

Can Transformers Capture Spatial Relations between Objects?

Authors: Chuan Wen, Dinesh Jayaraman, Yang Gao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To study the capability of current computer vision systems to recognize physically grounded spatial relations, we start by proposing precise relation definitions that permit consistently annotating a benchmark dataset. Despite the apparent simplicity of this task relative to others in the recognition literature, we observe that existing approaches perform poorly on this benchmark. We propose new approaches exploiting the long-range attention capabilities of transformers for this task, and evaluating key design principles. We identify a simple Relati Vi T architecture and demonstrate that it outperforms all current approaches. To our knowledge, this is the first method to convincingly outperform naive baselines on spatial relation prediction in in-the-wild settings.
Researcher Affiliation	Collaboration	Chuan Wen1,3,4, Dinesh Jayaraman2, Yang Gao1,3,4, 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2University of Pennsylvania 3Shanghai Artificial Intelligence Laboratory 4Shanghai Qi Zhi Institute
Pseudocode	No	The paper describes architectural designs and processes but does not include any explicit pseudocode blocks or algorithms.
Open Source Code	Yes	The code and datasets are available in https://sites.google.com/view/spatial-relation.
Open Datasets	Yes	The code and datasets are available in https://sites.google.com/view/spatial-relation. We use these two datasets, Rel3D (synthetic) and Spatial Sense+ (realistic) to comprehensively benchmark spatial relation prediction approaches. Rel3D (Goyal et al., 2020) and Spatial Sense (Yang et al., 2019).
Dataset Splits	Yes	Table 1: Statistics of Rel3D and Spatial Sense+ in our benchmark. Dataset #Predicate #Train #Validation #Test Rel3D 30 20454 2138 4744 Spatial Sense+ 9 5346 808 1100
Hardware Specification	No	The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions various software components and models (e.g., Glove, IBOT, DEIT, MOCO-v3, CLIP, DINO, MAE, BLENDER), but does not provide specific version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup	Yes	The hyperparameters of our best model Relati Vi T are shown in Table 8. The hyperparameters of other baselines follows the setting in Goyal et al. (2020) and Yang et al. (2019). Table 8 lists: epochs 200 100, optimizer adamw adamw, learning rate 1e-5 1e-5, lr schedule cosine cosine, lr warm-up epoch 5 5, weight decay 1e-4 1e-3, layer decay 0.75 0.75, query embedding pooling max max.