Can Transformers Capture Spatial Relations between Objects?
Authors: Chuan Wen, Dinesh Jayaraman, Yang Gao
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To study the capability of current computer vision systems to recognize physically grounded spatial relations, we start by proposing precise relation definitions that permit consistently annotating a benchmark dataset. Despite the apparent simplicity of this task relative to others in the recognition literature, we observe that existing approaches perform poorly on this benchmark. We propose new approaches exploiting the long-range attention capabilities of transformers for this task, and evaluating key design principles. We identify a simple Relati Vi T architecture and demonstrate that it outperforms all current approaches. To our knowledge, this is the first method to convincingly outperform naive baselines on spatial relation prediction in in-the-wild settings. |
| Researcher Affiliation | Collaboration | Chuan Wen1,3,4, Dinesh Jayaraman2, Yang Gao1,3,4, 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2University of Pennsylvania 3Shanghai Artificial Intelligence Laboratory 4Shanghai Qi Zhi Institute |
| Pseudocode | No | The paper describes architectural designs and processes but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | The code and datasets are available in https://sites.google.com/view/spatial-relation. |
| Open Datasets | Yes | The code and datasets are available in https://sites.google.com/view/spatial-relation. We use these two datasets, Rel3D (synthetic) and Spatial Sense+ (realistic) to comprehensively benchmark spatial relation prediction approaches. Rel3D (Goyal et al., 2020) and Spatial Sense (Yang et al., 2019). |
| Dataset Splits | Yes | Table 1: Statistics of Rel3D and Spatial Sense+ in our benchmark. Dataset #Predicate #Train #Validation #Test Rel3D 30 20454 2138 4744 Spatial Sense+ 9 5346 808 1100 |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions various software components and models (e.g., Glove, IBOT, DEIT, MOCO-v3, CLIP, DINO, MAE, BLENDER), but does not provide specific version numbers for any software libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | The hyperparameters of our best model Relati Vi T are shown in Table 8. The hyperparameters of other baselines follows the setting in Goyal et al. (2020) and Yang et al. (2019). Table 8 lists: epochs 200 100, optimizer adamw adamw, learning rate 1e-5 1e-5, lr schedule cosine cosine, lr warm-up epoch 5 5, weight decay 1e-4 1e-3, layer decay 0.75 0.75, query embedding pooling max max. |