Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

Authors: Dylan Sam, Ayan Chakrabarti, Afshin Rostamizadeh, Srikumar Ramalingam, Gui Citovsky, Sanjiv Kumar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets.
Researcher Affiliation Collaboration Dylan Sam Carnegie Mellon University Ayan Chakrabarti Google Research Afshin Rostamizadeh Google Research Srikumar Ramalingam Google Research Gui Citovsky Google Research Sanjiv Kumar Google Research
Pseudocode No The paper describes methods in prose, without structured pseudocode or algorithm blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: No code is provided, but all experimental details and evaluation metrics are clearly defined.
Open Datasets Yes We conduct all of our experiments using the Pile [Gao et al., 2020] as our data corpus, and in the context of pretraining a 1.7B parameter decoder-only language model with a UL2 objective [Tay et al., 2022] on 200B tokens.
Dataset Splits Yes We use a selection budget of 200B tokens, or approximately 20% of the Pile. This corresponds to roughly 170 million clusters for each embedding model, with an average cluster size of 5 examples.
Hardware Specification Yes Pretraining experiments for our 1.7B parameter language models are run on 512 v5 TPUs, where each pretraining run takes approximately 3 days. Training our proxy 200M parameter model took less than 1 day on 64 v5 TPUs.
Software Dependencies No The paper mentions using 'scikit-learn' for random projections but does not specify any version numbers for software or libraries.
Experiment Setup Yes We pretrain with a learning rate of 0.001 with a linear decay and a batch size of 1024. For our tokenizer, we use sentencepiece with a vocabulary size of 256k tokens. Clustering For performing RAC clustering for our pretraining experiments, we use a value of ϵ as the particular diameter of clusters: USE: ϵ = 0.2, Gecko: ϵ = 0.2, BERT: ϵ = 0.001, LM Token Embeds: ϵ = 0.001, LM Output Embeds: ϵ = 0.03