Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaling Laws for Optimal Data Mixtures
Authors: Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We extensively validate our scaling laws in three large-scale settings: large language models (LLMs), native multimodal models (NMMs), and large vision models (LVMs) pretraining. Fig. 3 presents a comparison between the actual loss achieved by our trained models and the loss predicted by our scaling laws. To further quantify this alignment, we report the mean relative error (MRE%) in Tab. 2, which reveals a consistently low MRE% for both laws. |
| Researcher Affiliation | Collaboration | Mustafa Shukor Sorbonne University Louis Bethune Apple Dan Busbridge Apple David Grangier Apple Enrico Fini Apple Alaaeldin El-Nouby Apple Pierre Ablin Apple |
| Pseudocode | No | The paper describes algorithms like L-BFGS and Basin-hopping algorithm in paragraph text and refers to their use, but it does not contain any structured pseudocode blocks or algorithms explicitly labeled as such for the methodology proposed. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The majority of the data that we use is public, with the exception of HQITP. We will release our code. |
| Open Datasets | Yes | For the main experiments, we use the k = 7 domains from slimpajama [57]. We use these domains as distributed by the authors, without any additional data filtering. For some smaller-scale analyses, we use up to k = 8 domains coming from the Pile dataset [22]: Wikipedia, Stack Exchange, Git Hub, pg19, arxiv, free law, openwebtext, and Pub Med Central. Following previous works [35, 38, 56] we train on a mixture of multimodal datasets, covering k = 3 data types: (1) text-only data from DCLM [36], (2) interleaved multimodal documents from Obelics [35], and (3) paired image-caption datasets from DFN [20], COYO [10], and a private collection of High-Quality Image-Text Pairs (HQITP). |
| Dataset Splits | Yes | In order to fit the scaling laws, we launch several training runs with different domain weights h, model sizes N, and number of tokens D, and record the loss on the target domain LT. [...] We randomly partition the domain weights into htrain = [h1, . . . , hq], of size q, and put the other domain weights into htest. We fit the scaling law on htrain and report the MRE on the test domain weights. |
| Hardware Specification | No | The paper does not explicitly state the specific models of GPUs, CPUs, or other hardware used for running the experiments. Mentions of "Fully Sharded Data Parallel (FSDP) [71]" refer to a software technique, not specific hardware specifications. |
| Software Dependencies | No | Optimizer Fully decoupled Adam W [41]. We use Fully Sharded Data Parallel (FSDP) [71]. These refer to optimizers and techniques described in cited papers, not specific software versions. The paper does not list specific version numbers for software libraries or environments. |
| Experiment Setup | Yes | All hyperparameters are described in Tab. 5 and Tab. 7. We closely follow the implementation of [56] and present in Tab. 6 the pre-training hyperparameters for the model configurations used in our scaling laws study. Table 5: Pre-training hyperparameters used for pre-training of LLM to conduct the main scaling laws study. Table 6: Pre-training hyperparameters used for pre-training of NMM to conduct the scaling laws study. Table 7: Pre-training hyperparameters used for the pre-training of LLM with PILE dataset to conduct the analyses. |