How Do Vision Transformers Work?
Authors: Namuk Park, Songkuk Kim
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We obtain the main experimental results from two sets of machines for CIFAR (Krizhevsky et al., 2009). The first set consists of an Intel Xeon W-2123 Processor, 32GB memory, and a single Ge Force RTX 2080 Ti, and the other set of four Intel Intel Broadwell CPUs, 15GB memory, and a single NVIDIA T4. For Image Net (Russakovsky et al., 2015), we use AMD Ryzen Threadripper 3960X 24-Core Processor, 256GB memory, and four Ge Force RTX 2080 Ti. |
| Researcher Affiliation | Collaboration | Yonsei University, NAVER AI Lab {namuk.park,songkuk}@yonsei.ac.kr |
| Pseudocode | No | The paper describes architectural patterns using diagrams (e.g., Figure 3, Figure 11) and textual descriptions, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/xxxnell/how-do-vits-work. |
| Open Datasets | Yes | We obtain the main experimental results from two sets of machines for CIFAR (Krizhevsky et al., 2009). ... For Image Net (Russakovsky et al., 2015) |
| Dataset Splits | No | The paper mentions training on CIFAR and ImageNet and evaluates on test datasets, but it does not specify explicit validation dataset splits (e.g., percentages or counts) for model training or hyperparameter tuning. It mentions using "10% of the training dataset" for Hessian max eigenvalue spectrum analysis, but this is not for the standard validation split. |
| Hardware Specification | Yes | The first set consists of an Intel Xeon W-2123 Processor, 32GB memory, and a single Ge Force RTX 2080 Ti, and the other set of four Intel Intel Broadwell CPUs, 15GB memory, and a single NVIDIA T4. For Image Net (Russakovsky et al., 2015), we use AMD Ryzen Threadripper 3960X 24-Core Processor, 256GB memory, and four Ge Force RTX 2080 Ti. |
| Software Dependencies | No | NN models are implemented in Py Torch (Paszke et al., 2019). While PyTorch is mentioned and cited, a specific version number (e.g., 1.9, 1.10) is not provided. |
| Experiment Setup | Yes | We train NNs using categorical cross-entropy (NLL) loss and Adam W optimizer (Loshchilov & Hutter, 2019) with initial learning rate of 1.25 10 4 and weight decay of 5 10 2. We also use cosine annealing scheduler (Loshchilov & Hutter, 2017). NNs are trained for 300 epochs with a batch size of 96 on CIFAR, and a batch size of 128 on Image Net. The learning rate is gradually increased (Goyal et al., 2017) for 5 epochs. Following Touvron et al. (2021), strong data augmentations such as Rand Augment (Cubuk et al., 2020), Random Erasing (Zhong et al., 2020), label smoothing (Szegedy et al., 2016), mixup (Zhang et al., 2018), and Cut Mix (Yun et al., 2019) are used for training. Stochastic depth (Huang et al., 2016) is also used to regularize NNs. |