Vision Transformers provably learn spatial structure
Authors: Samy Jelassi, Michael Sander, Yuanzhi Li
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lastly, we empirically verify that a Vi T with positional attention performs similarly to the original one on CIFAR-10/100, SVHN and Image Net. |
| Researcher Affiliation | Academia | Samy Jelassi Princeton University sjelassi@princeton.edu Michael E. Sander ENS, CNRS michael.sander@ens.fr Yuanzhi Li Carnegie Mellon University yuanzhil@andrew.cmu.edu |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See supplementary material. |
| Open Datasets | Yes | On the experimental side, we validate in Section 6 that Vi Ts learn spatial structure in images from the CIFAR-100 dataset... competitive with the vanilla Vi T on the Image Net, CIFAR-10/100 and SVHNs datasets. |
| Dataset Splits | Yes | Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Appendix. |
| Hardware Specification | Yes | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix. |
| Software Dependencies | No | The paper mentions 'Adam W' as an optimizer but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | For the small datasets, we use a Vi T with 7 layers, 12 heads and hidden/MLP dimension 384. For Image Net, we train a 'Vi T-tiny-patch16-224' [24]. Both models are trained with standard augmentations techniques [18] and using Adam W with a cosine learning rate scheduler. We run all the experiments for 300 epochs, with batch size 1024 for Imagenet and 128 otherwise and average our results over 5 seeds. We refer to Appendix A for the training details. |