Revisiting Neural Scaling Laws in Language and Vision
Authors: Ibrahim M. Alabdulmohsin, Behnam Neyshabur, Xiaohua Zhai
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide an empirical evaluation of the four scaling law estimators in several domains, including image classification (72 tasks), neural machine translation (5 tasks), language modeling (5 tasks), and other language-related evaluations (10 tasks). |
| Researcher Affiliation | Industry | Ibrahim Alabdulmohsin Google Research, Brain Team Zürich, Switzerland ibomohsin@google.com Behnam Neyshabur Google Research, Blueshift Team Mountain View, United States neyshabur@google.com Xiaohua Zhai Google Research, Brain Team Zürich, Switzerland xzhai@google.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks, or clearly labeled algorithm sections. |
| Open Source Code | Yes | The code and dataset for the remaining tasks used in this evaluation are made publicly available to facilitate further research in this domain3. 3Code and benchmark dataset will be made available at: https://github.com/google-research/ google-research/tree/master/revisiting_neural_scaling_laws. |
| Open Datasets | No | Some of the datasets used in our experiments are proprietary and cannot be released, such as JFT-300M. We also include experiments on publicly available datasets, such as Big-Bench for reproducibility. The paper mentions JFT-300M as proprietary and while BIG-Bench is mentioned as publicly available, it does not provide concrete access information (specific link, DOI, or repository) to the exact data or version used in their experiments for reproducibility. |
| Dataset Splits | Yes | In all experiments, we divide the learning curve into two splits: (1) one split used for training the scaling law estimators, and (2) one split used for evaluating extrapolation. Setting τ = xmax/2, where xmax is the maximum value of x in the data, the first split is the domain x [0, τ] while the second split is the domain x (τ, 2τ]. |
| Hardware Specification | No | All experiments are executed on Tensor Processing Units (TPUs). The paper specifies 'TPUs' but does not provide specific model numbers or detailed specifications for the hardware used. |
| Software Dependencies | No | The paper mentions optimizers like 'Adam optimizer [24]' and 'Adafactor optimizer [33]', but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | with a base learning rate of 5e-4, batch-size 4,096, and dropout rate of 0.1. models are trained with the per-token cross-entropy loss using Adafactor optimizer [33] with a batch-size of 500K tokens and a dropout rate of 0.1 [3]. |