GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers

Authors: Takeru Miyato, Bernhard Jaeger, Max Welling, Andreas Geiger

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By evaluating on multiple novel view synthesis (NVS) datasets in the sparse wide-baseline multi-view setting, we show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models without any additional learned parameters and only minor computational overhead.
Researcher Affiliation Academia 1 University of T ubingen, T ubingen AI Center 2 University of Amsterdam
Pseudocode Yes Algorithm 1 provides an algorithmic description based on Eq. (6) for single-head self-attention.
Open Source Code Yes Code: https://github.com/autonomousvision/gta.
Open Datasets Yes We evaluate our method on two synthetic 360 datasets with sparse and wide baseline views (CLEVR-TR and MSN-Hard) and on two datasets of real scenes with distant views (Real Estate10k and ACID).
Dataset Splits Yes Fig. 11: Mean and standard deviation plots of validation PSNRs on CLEVR-TR and MSN-Hard.
Hardware Specification Yes We train with 4 RTX 2080 Ti GPUs on CLEVR-TR and with 4 Nvidia A100 GPUs on the other datasets.
Software Dependencies No The paper mentions optimizers (Adam W) and normalization techniques (RMSNorm) and uses bfloat16 and float32 precision, but does not specify versions for major software frameworks (e.g., PyTorch, TensorFlow) or other libraries.
Experiment Setup Yes Table 15 shows dataset properties and hyperparameters that we use in our experiments. We train each model for 2M and 4M iterations on CLEVR-TR and MSH-Hard and for 300K iterations on both Real Estate10k and ACID, respectively. Learning rate 1e-4. Batch size 32.