LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate
Authors: Anthony Fuller, Daniel Kyrollos, Yousef Yassin, James Green
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that Look Here improves performance on classification (avg. 1.6%), against adversarial attack (avg. 5.4%), and decreases calibration error (avg. 1.5%) on Image Net without extrapolation. With extrapolation, Look Here outperforms the current So TA position encoding method, 2D-Ro PE, by 21.7% on Image Net when trained at 2242 px and tested at 10242 px. ... 4 Experiments Deep neural networks including Vi Ts can be sensitive to seemingly minor hyperparameter changes when trained from scratch. |
| Researcher Affiliation | Academia | Anthony Fuller, Daniel G. Kyrollos, Yousef Yassin, James R. Green Department of Systems and Computer Engineering Carleton University Ottawa, Ontario, Canada anthony.fuller@carleton.ca |
| Pseudocode | No | The paper describes its method using mathematical formulas and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or structured code-like steps. |
| Open Source Code | Yes | Code and data are available at: https://github.com/Green CUBIC/lookhere |
| Open Datasets | Yes | We train all models from scratch for 150 epochs on the first 99% of Image Net-1k using Huggingface s datasets library, i.e., load_dataset("imagenet-1k", split="train[:99%]") |
| Dataset Splits | Yes | We train all models from scratch for 150 epochs on the first 99% of Image Net-1k using Huggingface s datasets library, i.e., load_dataset("imagenet-1k", split="train[:99%]") , holding the last 1% as a validation set called minival , following [77, 81] (see Appendix A.4.1 for other hyperparameters). |
| Hardware Specification | Yes | Training takes around 3 days on an RTX 4090 GPU. Thus, all 80 training runs take around 240 GPU-days. |
| Software Dependencies | No | The paper mentions 'Adam W [109] using the default Py Torch implementation' and references the 'timm [104]' library, but it does not specify explicit version numbers for PyTorch, timm, or other key software components, which are necessary for reproducible dependency management. |
| Experiment Setup | Yes | Our 80 training runs result from the following Cartesian product: Position encoding: 1D-learn, 2D-sincos, Factorized, Fourier, RPE-learn, 2D-ALi Bi, 2D-Ro PE, LH-180, LH-90, LH-45 Augmentations: Rand Augment(2, 15) [80], 3-Augment [17] Learning rate: 1.5 10 3, 3.0 10 3 Weight decay: 0.02, 0.05 For each configuration, we train a Vi T-B/16 on 99% of the Image Net training set, holding the last 1% as a validation set called minival , following [77, 81] (see Appendix A.4.1 for other hyperparameters). We train all models from scratch for 150 epochs on 2242 px images. |