Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Authors: Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We explain the discrepancy by reproducing the Kaplan et al. scaling law on two datasets (Open Web Text2 and Refined Web) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., Chinchilla ) scaling law.
Researcher Affiliation Collaboration Tel Aviv University; correspondence to EMAIL and EMAIL. University of Washington. Jรผlich Supercomputing Centre (JSC) and LAION.
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present structured, code-like steps for any method.
Open Source Code Yes Code and data release. To facilitate future research, we share the data and the code necessary to reproduce our analyses and figures at https://github.com/formll/ resolving-scaling-law-discrepancies.
Open Datasets Yes Data. We perform our experiments on Open Web Text2 [18] which contains roughly 30B tokens of data from Reddit... as well a Refined Web [41] dataset which contains roughly 600B tokens from Common Crawl [1]...
Dataset Splits Yes We evaluate models on 160M tokens held out from the training data. ... For FLOP values where validation loss is not available... we use the smoothed training loss instead. ... We estimate the former directly by storing the validation loss on 100 subsamples of the holdout data...
Hardware Specification Yes Hardware and computational cost. We train our models on a cluster with 40GB A100 GPU s, using between 4 32 GPU s in parallel per training run.
Software Dependencies No The paper mentions 'Open LM', 'Pytorch', and 'xFormers' as software used, but does not provide specific version numbers for these software components to ensure reproducibility of the environment.
Experiment Setup Yes We largely base our initial training configuration on the hyperparameter search in Gadre et al. [17]. ... In Table 3 and Table 4 we describe our choice of hyperparameter in our experiments.