reproducibilityindex.ai

Observational Scaling Laws and the Predictability of Langauge Model Performance

Authors: Yangjun Ruan, Chris J. Maddison, Tatsunori B. Hashimoto

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from 100 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising predictability of complex scaling phenomena: we show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models; we show that the agent performance of models such as GPT-4 can be precisely predicted from simpler non-agentic benchmarks; and we show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
Researcher Affiliation	Collaboration	Yangjun Ruan1,2,3 yjruan@cs.toronto.edu Chris J. Maddison2,3 cmaddis@cs.toronto.edu Tatsunori Hashimoto1 thashim@stanford.edu 1Stanford University 2University of Toronto 3Vector Institute
Pseudocode	Yes	In Algorithm A.1, we include the detailed algorithm for fitting the observational scaling laws as described in Sec. 3. Algorithm A.1: Fitting observational scaling laws
Open Source Code	Yes	We release our code including the implementation and collected data at https://github.com/ryoungj/Obs Scaling. Details in scaling law fits For extracting PC measures, we fixed the number of PCs K = 3 as it covered 97% of the variation in benchmark performance and it consistently yielded the best performance across most of our experiments, see Appx. E.4 for robustness checks on PC selection.
Open Datasets	Yes	We collected a broad set of open LMs covering 21 model families (a collection of models across scales such as LLa MA-2 7B, 13B, 70B) and a total of 77 models. These encompass models trained from heterogeneous recipes, including standard training recipes like LLa MA [91], those trained on synthetic data like Phi [50], and models specifically trained on code data like Star Coder [48]. For this analysis, we consider only pretrained base models to avoid the complexities introduced by instruction tuning. We also include an analysis for instruction-tuned models that include proprietary ones like GPT-4 [66] in Appx. E.1, which demonstrates similar results. See Table D.1 for a detailed list of collected models.
Dataset Splits	Yes	We validate this through systematic holdouts for the test set, where we split available models into weaker and stronger ones based on both scale or capability (e.g., FLOPs or accuracy). We used the weaker models to fit the scaling law and evaluated the extrapolated predictions on the stronger ones. To prevent any train-test leakage, all preprocessing steps (e.g., PCA imputation) were fitted on the train set only and then applied to the test set.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running its own experiments or analysis. In the NeurIPS checklist, the authors explicitly state under Question 8 (Experiments Compute Resources) that they did not include this information because 'Our paper does not involve experiments that require significant computational resources and our results are not sensitive to the compute being used, so we did not include this information.'
Software Dependencies	No	We primarily sourced results from the Open LLM Leaderboard1 [8], with updates current as of May 6th, 2024. When there were missing benchmark results, we followed the standardized evaluation protocols of the Open LLM Leaderboard and used the LM Eval Harness [28] library to evaluate the LMs. For Human Eval, we primarily used the Eval Plus [55] library and followed their standardized protocols for evaluation, and sourced the results from the Eval Plus leaderboard2 when available. While it mentions specific libraries used, it does not provide version numbers for them.
Experiment Setup	Yes	For extracting PC measures, we fixed the number of PCs K = 3 as it covered 97% of the variation in benchmark performance and it consistently yielded the best performance across most of our experiments, see Appx. E.4 for robustness checks on PC selection. For the capability-equivalent scale transformation, we used the Llama-2 [92] as the reference model family as it is currently the most representative and widely used open model in the community. For better interpretability and visualization, we used the accuracy metric, typically defined as Y = 1 E, for fitting the scaling laws and making the plots.