Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Authors: Dylan Sam, Ayan Chakrabarti, Afshin Rostamizadeh, Srikumar Ramalingam, Gui Citovsky, Sanjiv Kumar
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets. |
| Researcher Affiliation | Collaboration | Dylan Sam Carnegie Mellon University Ayan Chakrabarti Google Research Afshin Rostamizadeh Google Research Srikumar Ramalingam Google Research Gui Citovsky Google Research Sanjiv Kumar Google Research |
| Pseudocode | No | The paper describes methods in prose, without structured pseudocode or algorithm blocks. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: No code is provided, but all experimental details and evaluation metrics are clearly defined. |
| Open Datasets | Yes | We conduct all of our experiments using the Pile [Gao et al., 2020] as our data corpus, and in the context of pretraining a 1.7B parameter decoder-only language model with a UL2 objective [Tay et al., 2022] on 200B tokens. |
| Dataset Splits | Yes | We use a selection budget of 200B tokens, or approximately 20% of the Pile. This corresponds to roughly 170 million clusters for each embedding model, with an average cluster size of 5 examples. |
| Hardware Specification | Yes | Pretraining experiments for our 1.7B parameter language models are run on 512 v5 TPUs, where each pretraining run takes approximately 3 days. Training our proxy 200M parameter model took less than 1 day on 64 v5 TPUs. |
| Software Dependencies | No | The paper mentions using 'scikit-learn' for random projections but does not specify any version numbers for software or libraries. |
| Experiment Setup | Yes | We pretrain with a learning rate of 0.001 with a linear decay and a batch size of 1024. For our tokenizer, we use sentencepiece with a vocabulary size of 256k tokens. Clustering For performing RAC clustering for our pretraining experiments, we use a value of ϵ as the particular diameter of clusters: USE: ϵ = 0.2, Gecko: ϵ = 0.2, BERT: ϵ = 0.001, LM Token Embeds: ϵ = 0.001, LM Output Embeds: ϵ = 0.03 |