Revisiting Neural Scaling Laws in Language and Vision

Authors: Ibrahim M. Alabdulmohsin, Behnam Neyshabur, Xiaohua Zhai

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide an empirical evaluation of the four scaling law estimators in several domains, including image classification (72 tasks), neural machine translation (5 tasks), language modeling (5 tasks), and other language-related evaluations (10 tasks).
Researcher Affiliation Industry Ibrahim Alabdulmohsin Google Research, Brain Team Zürich, Switzerland ibomohsin@google.com Behnam Neyshabur Google Research, Blueshift Team Mountain View, United States neyshabur@google.com Xiaohua Zhai Google Research, Brain Team Zürich, Switzerland xzhai@google.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks, or clearly labeled algorithm sections.
Open Source Code Yes The code and dataset for the remaining tasks used in this evaluation are made publicly available to facilitate further research in this domain3. 3Code and benchmark dataset will be made available at: https://github.com/google-research/ google-research/tree/master/revisiting_neural_scaling_laws.
Open Datasets No Some of the datasets used in our experiments are proprietary and cannot be released, such as JFT-300M. We also include experiments on publicly available datasets, such as Big-Bench for reproducibility. The paper mentions JFT-300M as proprietary and while BIG-Bench is mentioned as publicly available, it does not provide concrete access information (specific link, DOI, or repository) to the exact data or version used in their experiments for reproducibility.
Dataset Splits Yes In all experiments, we divide the learning curve into two splits: (1) one split used for training the scaling law estimators, and (2) one split used for evaluating extrapolation. Setting τ = xmax/2, where xmax is the maximum value of x in the data, the first split is the domain x [0, τ] while the second split is the domain x (τ, 2τ].
Hardware Specification No All experiments are executed on Tensor Processing Units (TPUs). The paper specifies 'TPUs' but does not provide specific model numbers or detailed specifications for the hardware used.
Software Dependencies No The paper mentions optimizers like 'Adam optimizer [24]' and 'Adafactor optimizer [33]', but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes with a base learning rate of 5e-4, batch-size 4,096, and dropout rate of 0.1. models are trained with the per-token cross-entropy loss using Adafactor optimizer [33] with a batch-size of 500K tokens and a dropout rate of 0.1 [3].