Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Gemstones: A Model Suite for Multi-Faceted Scaling Laws

Authors: Sean McLeish, John Kirchenbauer, David J. Miller, Siddharth Singh, Abhinav Bhatele, Micah Goldblum, Ashwinee Panda, Tom Goldstein

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we produce a vast array of model checkpoints for studying how model design and model selection impact scaling laws. Our models, called the Gemstones because they are loosely based on scaled-down variants of the Gemma architecture, vary in their parameter count, width/depth ratio, training tokens, learning rates, and cooldown schedules. By fitting scaling laws to these checkpoints, we confirm that scaling law parameters and interpretations indeed depend strongly on the selection of models and fitting procedure used, and we quantify the degree to which these decisions impact predictions. By exploiting the variation among our model checkpoints, we also analyze the impact of architectural shape across loss, benchmark performance and training time with findings consistent with design choices we see in industry models.
Researcher Affiliation	Academia	Sean Mc Leish1 , John Kirchenbauer1, David Yu Miller1, Siddharth Singh1 Abhinav Bhatele1, Micah Goldblum2, Ashwinee Panda1, Tom Goldstein1 1 University of Maryland, 2 Columbia University
Pseudocode	No	The paper includes a Python code snippet in Appendix M, but it is presented as an implementation detail for FLOP counting rather than a clearly labeled 'Pseudocode' or 'Algorithm' block outlining the main methodology or a specific algorithm being proposed.
Open Source Code	Yes	Code: github.com/mcleish7/gemstone-scaling-laws. We open-source more than 4000 checkpoints cumulatively trained on over 10 trillion tokens. We also open source the fitting code and logged metrics for all runs. We open source all models used in our analysis to Hugging Face [Wolf et al., 2020] and the logging from training on Weights and Biases in json format.
Open Datasets	Yes	As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers... We train each model for 350B tokens of Dolma 1.7 [Soldaini et al., 2024] data. Next, following Penedo et al. [2024], we benchmark our Gemstone models on MMLU [Hendrycks et al., 2020], Wino Grande [Sakaguchi et al., 2021], Open Book QA [Mihaylov et al., 2018], ARC [Clark et al., 2018], Common Sense QA [Talmor et al., 2018], PIQA [Bisk et al., 2020], SIQA [Sap et al., 2019] and Hella Swag [Zellers et al., 2019].
Dataset Splits	No	We fit all laws using the log perplexity of all trained models on a sample of 100 million tokens from a fixed, held-out validation set from the training distribution. The paper specifies a fixed, held-out validation set of 100 million tokens for fitting scaling laws. While this indicates a specific validation set, it does not provide the split percentage relative to the total training data or other specific details for training/test splits, nor does it explicitly mention how the benchmark datasets were split beyond their standard use.
Hardware Specification	Yes	All models are trained with tensor parallelism [Singh and Bhatele, 2022, Singh et al., 2024] over multiple nodes of AMD MI250X GPUs. To the best of our knowledge, this makes the Gemstone suite of models the largest collection trained on AMD GPUs.
Software Dependencies	No	We train all models using a fork of litgpt [AI, 2023] enhanced with Axo NN [Singh and Bhatele, 2022, Singh et al., 2024] tensor parallelism. The paper mentions `litgpt` and `Axo NN` but does not provide specific version numbers for these software components.
Experiment Setup	Yes	For the main set of training runs, we train each model for 350B tokens of Dolma 1.7 [Soldaini et al., 2024] data. We target a total batch size of 4 million tokens following[Touvron et al., 2023b, Dubey et al., 2024, Bai et al., 2023], with a context length of 2048 and a world batch size of 2048 sequences. Following Hägele et al. [2024] and Hu et al. [2024], we use a linear learning rate warm up over 80 million tokens, and then train at a constant learning rate, which we adjust for model size as described in Appendix A.1. We train with Adam W [Loshchilov and Hutter, 2017] with β parameters 0.9 and 0.95 and a weight decay of 0.1. We do not apply weight decay to the bias or normalization parameters.