Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Data Selection at Scale via Influence Distillation

Authors: Mahdi Nikdan, Vincent Cohen-Addad, Dan Alistarh, Vahab Mirrokni

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to 3.5 faster selection. (Abstract) and In this section, we evaluate Influence Distillation across several challenging tasks. We start by detailing the datasets, models, and hyperparameters used in our experiments. Then we present our main results and ablations. (Section 5, first paragraph).
Researcher Affiliation	Collaboration	Mahdi Nikdan ISTA & Google Research Vincent Cohen-Addad Google Research Dan Alistarh ISTA & Red Hat AI Vahab Mirrokni Google Research and Correspondence to EMAIL and EMAIL.
Pseudocode	No	The paper describes the methodology using mathematical formulations and descriptive text, but does not include a distinct 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code	Yes	The code will be uploaded as a zip file in supplementary material, along with instructions and commands for reproducing the results. (NeurIPS Paper Checklist, Question 5)
Open Datasets	Yes	We use Tulu V2 [Ivison et al., 2023], a combination of 9 instruction-tuning datasets containing approximately 5.8 million samples. ... We evaluate on six target datasets: MMLU [Hendrycks et al., 2021a,b], GSM8k [Cobbe et al., 2021], BBH [Suzgun et al., 2022], TyDIQA Clark et al. [2020], Codex [Chen et al., 2021], and SQuAD [Rajpurkar ets al., 2016]. For each, we assume access to 8-500 examples from their train, dev, or eval splits. Details are in Appendix E. In our running example, we use the CIFAR-10 dataset [Krizhevsky, 2009]. Appendix E.1 also lists licenses for Tulu V2 (ODC-BY License), MMLU (MIT License), GSM8K (MIT License), Big-Bench-Hard (MIT License), TyDIQA (Apache-2.0 License), Codex (MIT License), and SQuAD (CC BY-SA 4.0 License).
Dataset Splits	Yes	Unless stated otherwise, we randomly sample 200k examples from Tulu V2, and then use sampling methods to pick a subset of 10k samples from this pool. ... We evaluate on six target datasets: MMLU [Hendrycks et al., 2021a,b], GSM8k [Cobbe et al., 2021], BBH [Suzgun et al., 2022], TyDIQA Clark et al. [2020], Codex [Chen et al., 2021], and SQuAD [Rajpurkar ets al., 2016]. For each, we assume access to 8-500 examples from their train, dev, or eval splits. Details are in Appendix E. Appendix E.1 provides further specifics for MMLU, GSM8k, Big-Bench-Hard, SQuAD. For example, MMLU ... It includes 5 development samples per category and a total of 14,042 test samples. We use the development samples as our target set and evaluate the final model zero-shot on the test set.
Hardware Specification	Yes	All experiments are conducted on a single H100 GPU
Software Dependencies	No	The paper mentions the use of 'Adam W optimizer' and references the 'SciPy library [Virtanen et al., 2020]' for a specific algorithm in a running example, but it does not provide specific version numbers for key software components or libraries used in the main experimental setup, which is required for reproducibility.
Experiment Setup	Yes	Hyperparameters. We use the Adam W optimizer with a learning rate of 2 10 5 and a linear schedule for 2 epochs. The sequence length is fixed at 2048, and we use a microbatch size of 1 with gradient accumulation over 128 steps. All experiments are conducted on a single H100 GPU, and each are repeated with 3 random seeds, including the selection of 200k samples from Tulu V2. By default, we use first-order Influence Distillation with 4096 landmarks. We select the landmarks uniformly at random, as we find this performs comparably to more complex methods such as leverage score sampling (see Appendix K.4). Linear coefficients are computed via Kernel Ridge Regression (KRR) with an RBF kernel and dampening of 0.01. JVP embeddings are obtained from the first four transformer blocks using two random vectors (ℓ= 4, \|V \|= 2), following a brief warm-up on 10k random samples. The model is then reset and trained on the selected subset. This warm-up is needed to stabilize gradients (see Appendix A). Gradients are projected to 131072 dimensions via Hadamard projections; we use the largest dimension that fits in GPU memory, as projection cost does not depend on the dimension (Appendix I). After selection, we do not incorporate the sample weights during training, as experiments in Appendix J suggest this does not improve performance. (Section 5.1)