Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Authors: Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explain the discrepancy by reproducing the Kaplan et al. scaling law on two datasets (Open Web Text2 and Refined Web) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., Chinchilla ) scaling law. |
| Researcher Affiliation | Collaboration | Tel Aviv University; correspondence to tomerpor@gmail.com and ycarmon@tauex.tau.ac.il. University of Washington. Jülich Supercomputing Centre (JSC) and LAION. |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present structured, code-like steps for any method. |
| Open Source Code | Yes | Code and data release. To facilitate future research, we share the data and the code necessary to reproduce our analyses and figures at https://github.com/formll/ resolving-scaling-law-discrepancies. |
| Open Datasets | Yes | Data. We perform our experiments on Open Web Text2 [18] which contains roughly 30B tokens of data from Reddit... as well a Refined Web [41] dataset which contains roughly 600B tokens from Common Crawl [1]... |
| Dataset Splits | Yes | We evaluate models on 160M tokens held out from the training data. ... For FLOP values where validation loss is not available... we use the smoothed training loss instead. ... We estimate the former directly by storing the validation loss on 100 subsamples of the holdout data... |
| Hardware Specification | Yes | Hardware and computational cost. We train our models on a cluster with 40GB A100 GPU s, using between 4 32 GPU s in parallel per training run. |
| Software Dependencies | No | The paper mentions 'Open LM', 'Pytorch', and 'xFormers' as software used, but does not provide specific version numbers for these software components to ensure reproducibility of the environment. |
| Experiment Setup | Yes | We largely base our initial training configuration on the hyperparameter search in Gadre et al. [17]. ... In Table 3 and Table 4 we describe our choice of hyperparameter in our experiments. |