Early Convolutions Help Transformers See Better
Authors: Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollar, Ross Girshick
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In extensive experiments we show that replacing the Vi T patchify stem with a more standard convolutional stem (i) allows Vi T to converge faster ( 5.1), (ii) enables, for the first time, the use of either Adam W or SGD without a significant drop in accuracy ( 5.2), (iii) brings Vi T s stability w.r.t. learning rate and weight decay closer to that of modern CNNs ( 5.3), and (iv) yields improvements in Image Net [10] top-1 error of 1-2 percentage points ( 6). We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error. |
| Researcher Affiliation | Collaboration | Tete Xiao1,2 Mannat Singh1 Eric Mintun1 Trevor Darrell2 Piotr Dollár1 Ross Girshick1 1Facebook AI Research (FAIR) 2UC Berkeley |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating the release of source code for the described methodology. |
| Open Datasets | Yes | We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error. |
| Dataset Splits | Yes | We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error. |
| Hardware Specification | Yes | We time models in Py Torch on 8 32GB Volta GPUs. Batch sizes and training times are reported normalized to 8 32GB Volta GPUs (see Appendix). |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not specify its version number or versions of any other software libraries or dependencies used for the experiments. |
| Experiment Setup | Yes | In all experiments we train with a single half-period cosine learning rate decay schedule with a 5-epoch linear learning rate warm-up [16]. We use a minibatch size of 2048. Other hyperparameters use defaults: SGD momentum is 0.9 and Adam W s β1 = 0.9 and β2 = 0.999. We use Auto Augment [7], mixup [52] (α = 0.8), Cut Mix [51] (α = 1.0), and label smoothing [38] (ϵ = 0.1). |