reproducibilityindex.ai

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

Authors: Ibrahim M. Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a thorough evaluation across multiple tasks, such as image classiﬁcation, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations.
Researcher Affiliation	Industry	Ibrahim Alabdulmohsin , Xiaohua Zhai , Alexander Kolesnikov, Lucas Beyer Google Deep Mind Zürich, Switzerland {ibomohsin,xzhai,akolesnikov,lbeyer}@google.com
Pseudocode	No	The paper describes the 'star sweep' and 'grid sweep' procedures in detail in Section 3 but does not present them in a structured pseudocode or algorithm block format.
Open Source Code	No	The paper states: 'We use the big_vision codebase [10, 9] for conducting experiments in this project.' However, this refers to a codebase they utilized, not an explicit release of the source code for their specific methodology (SoViT) described in this paper.
Open Datasets	Yes	ILSRCV2012 [22], COCO captioning [48, 14], VQAv2 [28], GQA [37], CIFAR100 [46], Pets [51], Birds [74], Caltech [25], Cars [45], Colorectal [40], DTD [17], UC [76]. (All cited are public datasets)
Dataset Splits	Yes	We use a held-out 2% of Train to select hyper-parameters. Selecting them on Val would increase all scores. (Table 1 footnote). Train and minival splits train[:98%] and train[98%:]. (Table 8).
Hardware Specification	Yes	Experiments are executed on Tensor Processing Units (TPU). So Vi T-400m/14 is pretrained on 40 billion examples, which amounts to 9T GFLOPs and 230K TPUv3 core-hours.
Software Dependencies	No	The paper mentions 'Optimizer Ada Factor [60]' in Table 5, but it does not provide specific version numbers for this or any other key software components like programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or CUDA libraries.
Experiment Setup	Yes	Table 5 provides the set of hyperparameters used in the star and grid sweeps. We use a small batch size of 128 here in order to train multiple models in parallel on small hardware topologies. (Appendix B.1). Table 5 details 'Image Resolution 224 224 Batch size 128 Preprocessing Rescale(-1, 1) Augmentation Inception Crop, Left-Right Flip Optimizer Ada Factor [60] Gradient Clipping 1.0 Learning Rate 8e-4 Label Smoothing 0 Weight Decay 0.03 8e-4 Schedule Reverse SQRT, 10K Warmup steps, 50K Cooldown steps'.