Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining

Authors: Simin Fan, Maria Ios Glarou, Martin Jaggi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Climb Lab and Slim Pajama datasets demonstrate that GRAPE consistently outperforms baseline methods in terms of reasoning performance across 6 benchmarks.
Researcher Affiliation	Academia	Simin Fan, Maria Ios Glarou, Martin Jaggi MLO, EPFL EMAIL
Pseudocode	Yes	The pseudocode is presented in Algorithm 1. Algorithm 1 Group Robust Multi-target Adaptive Pr Etraining (GRAPE)
Open Source Code	Yes	The full implementation of GRAPE is open-sourced in https://github.com/Olivia-fsm/GRAPE_data_mixture_for_multi_target.
Open Datasets	Yes	We consider two pretraining corpora, Climb Lab [Diao et al., 2025] with K=20 source domains clustered by topics; and Slim Pajama with K=7 domains classified by collection sources. ... The source corpus consists of data from K = 6 languages selected from the wiki40b dataset [Guo et al., 2020], including high-resource languages English (en), French (fr), German (de), Spanish (es), Russian (ru) and Italian (it).
Dataset Splits	Yes	For each target task Tn, we use its standard validation set to compute the task loss ln(θt) and the Rate-of Improvement r(t) n needed for GRAPE s updates during training. ... Performance is measured by the language modeling loss, i.e. the log-perplexity (log-PPL), on held-out test sets for each target language.
Hardware Specification	Yes	Experiments were conducted using 4 H100 80GB GPUs.
Software Dependencies	No	We use the Adam W optimizer with standard hyperparameters for LLM pretraining. ... On large-scale runs with 0.7B models, we adopt the WSD scheduler [Hu et al., 2024] where the learning rate remains constant (lrmax = 1.5e 4) during training while linearly decaying to lrmin = 1.5e 5 at the last 20% of total iterations.
Experiment Setup	Yes	We update task weights z every Tz = 100 steps and domain weights α every Tα = 100 steps. Initial weights α0 and z0 are set to uniform distributions. We use the Adam W optimizer with standard hyperparameters for LLM pretraining. ... regularization coefficients µα = 1e 4 and µz = 1.5e 5, corresponding to the step-size of γ/µα = 1.5, γ/µz = 10. Both domain weights (α) and task weights (z) were periodically updated every Tα = Tz = 100 steps.