Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Early Convolutions Help Transformers See Better
Authors: Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollar, Ross Girshick
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In extensive experiments we show that replacing the Vi T patchify stem with a more standard convolutional stem (i) allows Vi T to converge faster ( 5.1), (ii) enables, for the first time, the use of either Adam W or SGD without a significant drop in accuracy ( 5.2), (iii) brings Vi T s stability w.r.t. learning rate and weight decay closer to that of modern CNNs ( 5.3), and (iv) yields improvements in Image Net [10] top-1 error of 1-2 percentage points ( 6). We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error. |
| Researcher Affiliation | Collaboration | Tete Xiao1,2 Mannat Singh1 Eric Mintun1 Trevor Darrell2 Piotr Dollár1 Ross Girshick1 1Facebook AI Research (FAIR) 2UC Berkeley |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating the release of source code for the described methodology. |
| Open Datasets | Yes | We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error. |
| Dataset Splits | Yes | We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error. |
| Hardware Specification | Yes | We time models in Py Torch on 8 32GB Volta GPUs. Batch sizes and training times are reported normalized to 8 32GB Volta GPUs (see Appendix). |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not specify its version number or versions of any other software libraries or dependencies used for the experiments. |
| Experiment Setup | Yes | In all experiments we train with a single half-period cosine learning rate decay schedule with a 5-epoch linear learning rate warm-up [16]. We use a minibatch size of 2048. Other hyperparameters use defaults: SGD momentum is 0.9 and Adam W s β1 = 0.9 and β2 = 0.999. We use Auto Augment [7], mixup [52] (α = 0.8), Cut Mix [51] (α = 1.0), and label smoothing [38] (ϵ = 0.1). |