LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate

Authors: Anthony Fuller, Daniel Kyrollos, Yousef Yassin, James Green

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that Look Here improves performance on classification (avg. 1.6%), against adversarial attack (avg. 5.4%), and decreases calibration error (avg. 1.5%) on Image Net without extrapolation. With extrapolation, Look Here outperforms the current So TA position encoding method, 2D-Ro PE, by 21.7% on Image Net when trained at 2242 px and tested at 10242 px. ... 4 Experiments Deep neural networks including Vi Ts can be sensitive to seemingly minor hyperparameter changes when trained from scratch.
Researcher Affiliation Academia Anthony Fuller, Daniel G. Kyrollos, Yousef Yassin, James R. Green Department of Systems and Computer Engineering Carleton University Ottawa, Ontario, Canada anthony.fuller@carleton.ca
Pseudocode No The paper describes its method using mathematical formulas and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or structured code-like steps.
Open Source Code Yes Code and data are available at: https://github.com/Green CUBIC/lookhere
Open Datasets Yes We train all models from scratch for 150 epochs on the first 99% of Image Net-1k using Huggingface s datasets library, i.e., load_dataset("imagenet-1k", split="train[:99%]")
Dataset Splits Yes We train all models from scratch for 150 epochs on the first 99% of Image Net-1k using Huggingface s datasets library, i.e., load_dataset("imagenet-1k", split="train[:99%]") , holding the last 1% as a validation set called minival , following [77, 81] (see Appendix A.4.1 for other hyperparameters).
Hardware Specification Yes Training takes around 3 days on an RTX 4090 GPU. Thus, all 80 training runs take around 240 GPU-days.
Software Dependencies No The paper mentions 'Adam W [109] using the default Py Torch implementation' and references the 'timm [104]' library, but it does not specify explicit version numbers for PyTorch, timm, or other key software components, which are necessary for reproducible dependency management.
Experiment Setup Yes Our 80 training runs result from the following Cartesian product: Position encoding: 1D-learn, 2D-sincos, Factorized, Fourier, RPE-learn, 2D-ALi Bi, 2D-Ro PE, LH-180, LH-90, LH-45 Augmentations: Rand Augment(2, 15) [80], 3-Augment [17] Learning rate: 1.5 10 3, 3.0 10 3 Weight decay: 0.02, 0.05 For each configuration, we train a Vi T-B/16 on 99% of the Image Net training set, holding the last 1% as a validation set called minival , following [77, 81] (see Appendix A.4.1 for other hyperparameters). We train all models from scratch for 150 epochs on 2242 px images.