Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Scaling Laws for Precision
Authors: Tanishq Kumar, Zachary Ankner, Benjamin Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Re, Aditi Raghunathan
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens. Our experiments consist of a sweep of language model pretraining runs over N [30, 60, 110, 220] million parameters (non-embedding) and D [1.5, 3, 6, 13, 26] billion tokens. |
| Researcher Affiliation | Collaboration | 1Harvard University 2Stanford University 3MIT 4Databricks 5Carnegie Mellon University |
| Pseudocode | No | The paper describes methods and derivations but does not include any explicitly labeled pseudocode or algorithm blocks. Procedures are described in paragraph form. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any repository links. |
| Open Datasets | Yes | We train and evaluate a suite of OLMo-style models on the Dolma V1.7 dataset (Groeneveld et al., 2024; Soldaini et al., 2024) |
| Dataset Splits | No | We launch over 20 runs for each (N, D) combination to study scaling in precision, trained and validated on the common crawl split of the Dolma dataset (Soldaini et al., 2024). The paper refers to a 'common crawl split' but does not provide specific percentages or counts for training, validation, or test sets. |
| Hardware Specification | Yes | We train all our models with fake (simulated) quantization on NVidia H100 GPUs to remain hardware agnostic, not taking advantage of any true low-precision computation. |
| Software Dependencies | No | The paper mentions various components and optimizers such as Swi GLU activations, Ro PE embeddings, RMSLayer Norm, and Adam with specific beta and epsilon values. However, it does not provide version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | We use a standard Transformer++ implementation: Swi GLU activations (Shazeer, 2020), Ro PE embeddings (Su et al., 2021), RMSLayer Norm, Adam ฮฒ values of (0.9, 0.95). We adopt a cosine learning rate schedule with 10% warmup period and peak learning rate of 6e-4 for the smallest model and learning rates scaled with width and depth according to depth-ยตP for the larger models (Yang et al., 2022; Bordelon et al., 2023). We use a sequence length of 1024 and batch size of 256 throughout, with Adam ฯต 1e-15, following (Wortsman et al., 2023b). We use weight decay of 0.1... |