Early Convolutions Help Transformers See Better

Authors: Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollar, Ross Girshick

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In extensive experiments we show that replacing the Vi T patchify stem with a more standard convolutional stem (i) allows Vi T to converge faster ( 5.1), (ii) enables, for the first time, the use of either Adam W or SGD without a significant drop in accuracy ( 5.2), (iii) brings Vi T s stability w.r.t. learning rate and weight decay closer to that of modern CNNs ( 5.3), and (iv) yields improvements in Image Net [10] top-1 error of 1-2 percentage points ( 6). We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error.
Researcher Affiliation Collaboration Tete Xiao1,2 Mannat Singh1 Eric Mintun1 Trevor Darrell2 Piotr Dollár1 Ross Girshick1 1Facebook AI Research (FAIR) 2UC Berkeley
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code No The paper does not provide any explicit statement or link indicating the release of source code for the described methodology.
Open Datasets Yes We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error.
Dataset Splits Yes We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error.
Hardware Specification Yes We time models in Py Torch on 8 32GB Volta GPUs. Batch sizes and training times are reported normalized to 8 32GB Volta GPUs (see Appendix).
Software Dependencies No The paper mentions 'Py Torch' but does not specify its version number or versions of any other software libraries or dependencies used for the experiments.
Experiment Setup Yes In all experiments we train with a single half-period cosine learning rate decay schedule with a 5-epoch linear learning rate warm-up [16]. We use a minibatch size of 2048. Other hyperparameters use defaults: SGD momentum is 0.9 and Adam W s β1 = 0.9 and β2 = 0.999. We use Auto Augment [7], mixup [52] (α = 0.8), Cut Mix [51] (α = 1.0), and label smoothing [38] (ϵ = 0.1).