Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

Authors: Ibrahim M. Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations.
Researcher Affiliation Industry Ibrahim Alabdulmohsin , Xiaohua Zhai , Alexander Kolesnikov, Lucas Beyer Google Deep Mind Zürich, Switzerland {ibomohsin,xzhai,akolesnikov,lbeyer}@google.com
Pseudocode No The paper describes the 'star sweep' and 'grid sweep' procedures in detail in Section 3 but does not present them in a structured pseudocode or algorithm block format.
Open Source Code No The paper states: 'We use the big_vision codebase [10, 9] for conducting experiments in this project.' However, this refers to a codebase they utilized, not an explicit release of the source code for their specific methodology (SoViT) described in this paper.
Open Datasets Yes ILSRCV2012 [22], COCO captioning [48, 14], VQAv2 [28], GQA [37], CIFAR100 [46], Pets [51], Birds [74], Caltech [25], Cars [45], Colorectal [40], DTD [17], UC [76]. (All cited are public datasets)
Dataset Splits Yes We use a held-out 2% of Train to select hyper-parameters. Selecting them on Val would increase all scores. (Table 1 footnote). Train and minival splits train[:98%] and train[98%:]. (Table 8).
Hardware Specification Yes Experiments are executed on Tensor Processing Units (TPU). So Vi T-400m/14 is pretrained on 40 billion examples, which amounts to 9T GFLOPs and 230K TPUv3 core-hours.
Software Dependencies No The paper mentions 'Optimizer Ada Factor [60]' in Table 5, but it does not provide specific version numbers for this or any other key software components like programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or CUDA libraries.
Experiment Setup Yes Table 5 provides the set of hyperparameters used in the star and grid sweeps. We use a small batch size of 128 here in order to train multiple models in parallel on small hardware topologies. (Appendix B.1). Table 5 details 'Image Resolution 224 224 Batch size 128 Preprocessing Rescale(-1, 1) Augmentation Inception Crop, Left-Right Flip Optimizer Ada Factor [60] Gradient Clipping 1.0 Learning Rate 8e-4 Label Smoothing 0 Weight Decay 0.03 8e-4 Schedule Reverse SQRT, 10K Warmup steps, 50K Cooldown steps'.