Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
Authors: Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. 3. Experiments |
| Researcher Affiliation | Industry | 1Apple. Correspondence to: Pierre Ablin <p EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Sampling from mix(h) = Pk i=1 hi Di Algorithm 2 Pre-training loop for a Soup-of-Experts to minimize the loss function L(S, E, ω) in Equation 3. Algorithm 3 (Grangier et al., 2024b) Estimating specialist domain weights that are good for a specialized dataset Dspe |
| Open Source Code | No | The paper does not provide any explicit statement or link for open-sourcing the code for the described methodology. |
| Open Datasets | Yes | Pretraining domains We pre-train language model on Redpajama2 (Weber et al., 2024), a widely used curated web-crawl dataset. Specialization domains We consider 16 datasets from the PILE (Gao et al., 2020) as target specialization sets: arxiv, dm mathematics, enron emails, europarl, freelaw, github, hackernews, nih exporter, openwebtext, pg19, phil papers, pubmed, stackexchange, ubuntu, uspto, and wikipedia. |
| Dataset Splits | No | The paper mentions using datasets and evaluating on "specialization domains" or "held-out part of these datasets" but does not specify exact percentages, sample counts, or clear train/validation/test splits. |
| Hardware Specification | Yes | Infrastructure We train each model on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions algorithms like Adam and Sentence-BERT but does not provide specific version numbers for any software libraries or frameworks used. |
| Experiment Setup | Yes | Table 3. Training hyperparameters. Table 2. Model architectures. |