reproducibilityindex.ai

Early Convolutions Help Transformers See Better

Authors: Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollar, Ross Girshick

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In extensive experiments we show that replacing the Vi T patchify stem with a more standard convolutional stem (i) allows Vi T to converge faster ( 5.1), (ii) enables, for the ﬁrst time, the use of either Adam W or SGD without a signiﬁcant drop in accuracy ( 5.2), (iii) brings Vi T s stability w.r.t. learning rate and weight decay closer to that of modern CNNs ( 5.3), and (iv) yields improvements in Image Net [10] top-1 error of 1-2 percentage points ( 6). We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error.
Researcher Affiliation	Collaboration	Tete Xiao1,2 Mannat Singh1 Eric Mintun1 Trevor Darrell2 Piotr Dollár1 Ross Girshick1 1Facebook AI Research (FAIR) 2UC Berkeley
Pseudocode	No	No pseudocode or algorithm blocks are present in the paper.
Open Source Code	No	The paper does not provide any explicit statement or link indicating the release of source code for the described methodology.
Open Datasets	Yes	We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error.
Dataset Splits	Yes	We conduct experiments using Image Net-1k [10] s standard training and validation sets, and report top-1 error.
Hardware Specification	Yes	We time models in Py Torch on 8 32GB Volta GPUs. Batch sizes and training times are reported normalized to 8 32GB Volta GPUs (see Appendix).
Software Dependencies	No	The paper mentions 'Py Torch' but does not specify its version number or versions of any other software libraries or dependencies used for the experiments.
Experiment Setup	Yes	In all experiments we train with a single half-period cosine learning rate decay schedule with a 5-epoch linear learning rate warm-up [16]. We use a minibatch size of 2048. Other hyperparameters use defaults: SGD momentum is 0.9 and Adam W s β1 = 0.9 and β2 = 0.999. We use Auto Augment [7], mixup [52] (α = 0.8), Cut Mix [51] (α = 1.0), and label smoothing [38] (ϵ = 0.1).