Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FlexOLMo: Open Language Models for Flexible Data Use

Authors: Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Jacob Morrison, Evan Pete Walsh, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Mike Lewis, Scott Yih, Dirk Groeneveld, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei W Koh, Hanna Hajishirzi, Ali Farhadi, Sewon Min

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard Mo E trained without data restrictions using the same training FLOPs.
Researcher Affiliation	Collaboration	a Allen Institute for AI w University of Washington b University of California, Berkeley s Stanford University m MIT
Pseudocode	No	The paper describes its methodology in text, for example in Section 3 'FLEXOLMO: LMs with Flexible Data Use' and its subsections '3.3 Training Algorithm', '3.3.1 Training Experts to Coordinate', '3.3.2 Domain-Informed Router', and '3.3.3 Optional Router Training on Proxy Data'. While these sections detail the procedural steps of their approach, they do not present them in a structured, code-like pseudocode block or a clearly labeled algorithm format.
Open Source Code	Yes	Code github.com/allenai/FlexOlmo
Open Datasets	Yes	To validate FLEXOLMO, we curate a data mixture called FLEXMIX, which includes a public training set along with seven domain-specific sets (e.g., news, educational text, and Reddit). These domains are chosen to simulate scenarios where high-quality data that can benefit LM training is not publicly available. Our corpus comprises a single Public Mix and seven closed sets, either real or simulated, which are designed to be disjoint from each other. Figure 5 in B provides the statistics. Public Mix represents general web text based on Common Crawl (CC) 1. Specifically, we took the Baseline version of DCLM [74], excluding news and creative writing content (described below). Code includes code repositories from Starcoder [76, 77] with additional quality filtering as in [78]. Academic includes open-access academic papers obtained from [79]; these are papers from [80, 81] but re-processed using olm OCR [79] for cleaner plain text. Math includes math-relevant content, including web pages about or using math and math problem sets, obtained by combining Dolmino Math Mix [78] and Fine Math4+ [82]. Reddit contains posts and comments originally sourced and released by Dolma [83], further filtered and processed to improve quality (details in Appendix B).
Dataset Splits	Yes	We evaluate our models and baselines on a large and diverse collection of well-established benchmarks, consisting of 31 tasks across 10 categories, broadly grouped into (1) general-purpose LM benchmarks and (2) domain-specific evaluations. More details are provided in C. Table 1: Evaluation of FLEXOLMO trained on four sets (public mix, math, educational text and code), tested on 24 tasks with 100 samples per subtask. Table 2: Evaluation of FLEXOLMO trained on eight sets (public mix and seven simulated closed sets) on 31 tasks across 10 categories, tested with 1,000 samples per subtask. To obtain proxy data Di, we train a binary classifier to distinguish Di from Dpub and select public samples with the highest predicted likelihood of belonging to Di. Specifically, we finetune Snowflake/snowflake-arctic-embed-xs [111], which contains 22M parameters, using a learning rate of 3 10 6. The classifier is trained on a balanced dataset of 500,000 samples (250,000 documents from each source public and private).
Hardware Specification	Yes	The public model is pretrained using 512 H100 GPUs with a global batch size of 4 million tokens for three days.
Software Dependencies	No	The paper mentions several models and tools used, such as "Llama-3.1-8B-Instruct [107]", "OLMo-2-1124-7B-Instruct [78]", "Grit LM/Grit LM-7B [13]", and "Snowflake/snowflake-arctic-embed-xs [111]", and describes their finetuning parameters. However, it does not explicitly list specific version numbers for core software dependencies like Python, PyTorch, TensorFlow, or CUDA, which would be necessary for full reproducibility of the software environment.
Experiment Setup	Yes	For the public model Mpub, we use a dense model with 7 billion parameters following the OLMo 2 architecture [78]. This model contains 32 layers with hidden dimension 4,096 and is trained on our public mix for 1 trillion tokens. Following [78], we use a learning rate of 0.0009 and the Adam W optimizer with parameters β1 = 0.9 and β2 = 0.95 and a cosine learning rate scheduler. The public model is pretrained using 512 H100 GPUs with a global batch size of 4 million tokens for three days. Each data owner then takes this checkpoint and performs continued-pretraining for 50 billion tokens on their own data (totaling 400B tokens across all experts). For the optional router training, we use 5 billion tokens in total. The final FLEXOLMO, trained on 8 sets, has 37 billion total parameters with 20 billion active (4 active experts out of 8).