Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

Authors: Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. 3. Experiments
Researcher Affiliation	Industry	1Apple. Correspondence to: Pierre Ablin <p EMAIL>.
Pseudocode	Yes	Algorithm 1 Sampling from mix(h) = Pk i=1 hi Di Algorithm 2 Pre-training loop for a Soup-of-Experts to minimize the loss function L(S, E, ω) in Equation 3. Algorithm 3 (Grangier et al., 2024b) Estimating specialist domain weights that are good for a specialized dataset Dspe
Open Source Code	No	The paper does not provide any explicit statement or link for open-sourcing the code for the described methodology.
Open Datasets	Yes	Pretraining domains We pre-train language model on Redpajama2 (Weber et al., 2024), a widely used curated web-crawl dataset. Specialization domains We consider 16 datasets from the PILE (Gao et al., 2020) as target specialization sets: arxiv, dm mathematics, enron emails, europarl, freelaw, github, hackernews, nih exporter, openwebtext, pg19, phil papers, pubmed, stackexchange, ubuntu, uspto, and wikipedia.
Dataset Splits	No	The paper mentions using datasets and evaluating on "specialization domains" or "held-out part of these datasets" but does not specify exact percentages, sample counts, or clear train/validation/test splits.
Hardware Specification	Yes	Infrastructure We train each model on 8 A100 GPUs.
Software Dependencies	No	The paper mentions algorithms like Adam and Sentence-BERT but does not provide specific version numbers for any software libraries or frameworks used.
Experiment Setup	Yes	Table 3. Training hyperparameters. Table 2. Model architectures.