Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Authors: Ibrahim M. Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. |
| Researcher Affiliation | Industry | Ibrahim Alabdulmohsin , Xiaohua Zhai , Alexander Kolesnikov, Lucas Beyer Google Deep Mind Zürich, Switzerland {ibomohsin,xzhai,akolesnikov,lbeyer}@google.com |
| Pseudocode | No | The paper describes the 'star sweep' and 'grid sweep' procedures in detail in Section 3 but does not present them in a structured pseudocode or algorithm block format. |
| Open Source Code | No | The paper states: 'We use the big_vision codebase [10, 9] for conducting experiments in this project.' However, this refers to a codebase they utilized, not an explicit release of the source code for their specific methodology (SoViT) described in this paper. |
| Open Datasets | Yes | ILSRCV2012 [22], COCO captioning [48, 14], VQAv2 [28], GQA [37], CIFAR100 [46], Pets [51], Birds [74], Caltech [25], Cars [45], Colorectal [40], DTD [17], UC [76]. (All cited are public datasets) |
| Dataset Splits | Yes | We use a held-out 2% of Train to select hyper-parameters. Selecting them on Val would increase all scores. (Table 1 footnote). Train and minival splits train[:98%] and train[98%:]. (Table 8). |
| Hardware Specification | Yes | Experiments are executed on Tensor Processing Units (TPU). So Vi T-400m/14 is pretrained on 40 billion examples, which amounts to 9T GFLOPs and 230K TPUv3 core-hours. |
| Software Dependencies | No | The paper mentions 'Optimizer Ada Factor [60]' in Table 5, but it does not provide specific version numbers for this or any other key software components like programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or CUDA libraries. |
| Experiment Setup | Yes | Table 5 provides the set of hyperparameters used in the star and grid sweeps. We use a small batch size of 128 here in order to train multiple models in parallel on small hardware topologies. (Appendix B.1). Table 5 details 'Image Resolution 224 224 Batch size 128 Preprocessing Rescale(-1, 1) Augmentation Inception Crop, Left-Right Flip Optimizer Ada Factor [60] Gradient Clipping 1.0 Learning Rate 8e-4 Label Smoothing 0 Weight Decay 0.03 8e-4 Schedule Reverse SQRT, 10K Warmup steps, 50K Cooldown steps'. |