reproducibilityindex.ai

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Authors: Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We explain the discrepancy by reproducing the Kaplan et al. scaling law on two datasets (Open Web Text2 and Refined Web) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., Chinchilla ) scaling law.
Researcher Affiliation	Collaboration	Tel Aviv University; correspondence to tomerpor@gmail.com and ycarmon@tauex.tau.ac.il. University of Washington. Jülich Supercomputing Centre (JSC) and LAION.
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present structured, code-like steps for any method.
Open Source Code	Yes	Code and data release. To facilitate future research, we share the data and the code necessary to reproduce our analyses and figures at https://github.com/formll/ resolving-scaling-law-discrepancies.
Open Datasets	Yes	Data. We perform our experiments on Open Web Text2 [18] which contains roughly 30B tokens of data from Reddit... as well a Refined Web [41] dataset which contains roughly 600B tokens from Common Crawl [1]...
Dataset Splits	Yes	We evaluate models on 160M tokens held out from the training data. ... For FLOP values where validation loss is not available... we use the smoothed training loss instead. ... We estimate the former directly by storing the validation loss on 100 subsamples of the holdout data...
Hardware Specification	Yes	Hardware and computational cost. We train our models on a cluster with 40GB A100 GPU s, using between 4 32 GPU s in parallel per training run.
Software Dependencies	No	The paper mentions 'Open LM', 'Pytorch', and 'xFormers' as software used, but does not provide specific version numbers for these software components to ensure reproducibility of the environment.
Experiment Setup	Yes	We largely base our initial training configuration on the hyperparameter search in Gadre et al. [17]. ... In Table 3 and Table 4 we describe our choice of hyperparameter in our experiments.