reproducibilityindex.ai

LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate

Authors: Anthony Fuller, Daniel Kyrollos, Yousef Yassin, James Green

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that Look Here improves performance on classification (avg. 1.6%), against adversarial attack (avg. 5.4%), and decreases calibration error (avg. 1.5%) on Image Net without extrapolation. With extrapolation, Look Here outperforms the current So TA position encoding method, 2D-Ro PE, by 21.7% on Image Net when trained at 2242 px and tested at 10242 px. ... 4 Experiments Deep neural networks including Vi Ts can be sensitive to seemingly minor hyperparameter changes when trained from scratch.
Researcher Affiliation	Academia	Anthony Fuller, Daniel G. Kyrollos, Yousef Yassin, James R. Green Department of Systems and Computer Engineering Carleton University Ottawa, Ontario, Canada anthony.fuller@carleton.ca
Pseudocode	No	The paper describes its method using mathematical formulas and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or structured code-like steps.
Open Source Code	Yes	Code and data are available at: https://github.com/Green CUBIC/lookhere
Open Datasets	Yes	We train all models from scratch for 150 epochs on the first 99% of Image Net-1k using Huggingface s datasets library, i.e., load_dataset("imagenet-1k", split="train[:99%]")
Dataset Splits	Yes	We train all models from scratch for 150 epochs on the first 99% of Image Net-1k using Huggingface s datasets library, i.e., load_dataset("imagenet-1k", split="train[:99%]") , holding the last 1% as a validation set called minival , following [77, 81] (see Appendix A.4.1 for other hyperparameters).
Hardware Specification	Yes	Training takes around 3 days on an RTX 4090 GPU. Thus, all 80 training runs take around 240 GPU-days.
Software Dependencies	No	The paper mentions 'Adam W [109] using the default Py Torch implementation' and references the 'timm [104]' library, but it does not specify explicit version numbers for PyTorch, timm, or other key software components, which are necessary for reproducible dependency management.
Experiment Setup	Yes	Our 80 training runs result from the following Cartesian product: Position encoding: 1D-learn, 2D-sincos, Factorized, Fourier, RPE-learn, 2D-ALi Bi, 2D-Ro PE, LH-180, LH-90, LH-45 Augmentations: Rand Augment(2, 15) [80], 3-Augment [17] Learning rate: 1.5 10 3, 3.0 10 3 Weight decay: 0.02, 0.05 For each configuration, we train a Vi T-B/16 on 99% of the Image Net training set, holding the last 1% as a validation set called minival , following [77, 81] (see Appendix A.4.1 for other hyperparameters). We train all models from scratch for 150 epochs on 2242 px images.