Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Final-Model-Only Data Attribution with a Unifying View of Gradient-Based Methods

Authors: Dennis Wei, Inkit Padhi, Soumya Ghosh, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, Maria Chang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate empirically the quality of these gradient-based approximations to further training, for tabular, image, and text datasets and models. We find that the approximation quality of first-order methods is sometimes high but decays with the amount of further training. In contrast, the approximations given by influence function methods are more stable but surprisingly lower in quality.
Researcher Affiliation	Industry	Dennis Wei IBM Research EMAIL Inkit Padhi IBM Research Soumya Ghosh Merck Research Labs Amit Dhurandhar IBM Research Karthikeyan Natesan Ramamurthy IBM Research Maria Chang IBM Research
Pseudocode	No	The paper describes methods and derivations in mathematical form and prose, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We provide code to help reproduce our experiments at https://github.com/IBM/fimoda.
Open Datasets	Yes	We used four tabular datasets: two for regression, Concrete Strength and Energy Efficiency from the UCI repository [33] following [27], and two larger ones for classification, FICO Challenge [34] and Folktables [35]. For image data, we chose the CIFAR-10 image classification dataset [36], while for text, we used the SST-2 sentiment classification dataset [37], which is part of the GLUE benchmark [38].
Dataset Splits	Yes	For Concrete, Energy, and FICO, we split the dataset 90%-10% into training and test sets (using the scikit-learn [48] package s train_test_split() with random_state=0) and standardized the features to have zero mean and unit variance. ... For Folktables, ...The dataset was split 75%-25% into training and test (again using train_test_split() with random_state=0)... For CIFAR-10, we used the given split into training and test sets. ...We used a random subset of 1000 samples as validation set. For SST-2, we used the given split into training, validation, and test sets.
Hardware Specification	Yes	Experiments were run on an internal computing cluster providing nodes with 32 GB of CPU memory, V100 GPUs with 32 GB of GPU memory, and occasionally A100 GPUs with 40 or 80 GB of GPU memory. V100s were sufficient for all training however. One CPU and one GPU were used at a time.
Software Dependencies	No	The paper mentions PyTorch [50], scikit-learn [48], and Adam W [41], as well as various GitHub repositories, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	For all tabular datasets, we used a 2-hidden-layer multi-layer perceptron (MLP) with 128 units in each hidden layer. ... trained using SGD for T = 1000 epochs and a batch size of 128 ... Learning rates were as follows: 0.3 for Concrete and Energy, 0.001 for FICO, and 0.01 for Folktables. For CIFAR-10, we used a Res Net-9 architecture [39] and trained it using SGD to minimize cross-entropy loss... for T = 50 epochs. We trained with a batch size of 512, learning rate of 0.4, and weight decay of 0.001. For SST-2, we fine-tuned a pre-trained BERT model [40] using Adam W [41]... with a batch size of 64, learning rate of 10 5, zero weight decay, and gradient norm clipping at a threshold of 1. For the case in which the further training algorithm A is the same as the initial training algorithm A, the learning rate for A was chosen to be one order of magnitude smaller than that for A, i.e., 0.03 for Concrete and Energy, 10 4 for FICO and Folktables, 0.04 for CIFAR-10, 10 6 for SST-2. The maximum number of epochs T for A was also chosen to be a fraction of T: T = 500 for Concrete and Energy following [27], T = 200 for FICO, T = 25 for Folktables, T = 10 for CIFAR-10, and T = 1 for SST-2.