Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rescaled Influence Functions: Accurate Data Attribution in High Dimension

Authors: Ittai Rubinstein, Samuel Hopkins

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now present empirical findings on the accuracy of RIF estimates for leave-T-out effects. Our experimental setup is inspired by the seminal work of [KL17, KATL19], who assess the accuracy of influence function estimates using logistic regression as a testbed. We compare IF, NS, and RIF estimates across the first five datasets in Table 1, spanning vision, NLP, and audio classification tasks.
Researcher Affiliation	Academia	Ittai Rubinstein EECS and CSAIL MIT Cambridge, MA EMAIL Samuel B. Hopkins EECS and CSAIL MIT Cambridge, MA EMAIL
Pseudocode	No	The paper describes the methods and theoretical results using mathematical formulations and textual explanations (e.g., Section 1.1, Section 3, Appendix A), but it does not include any explicitly structured pseudocode or algorithm blocks.
Open Source Code	Yes	An implementation of our experiments is available at github.com/ittai-rubinstein/rescaled-influence-functions. This appendix provides a concise overview of the procedures implemented in the accompanying code. In the supplemental material, we include a library that can be used to reproduce all the experimental results in our paper and we plan to include a link to a public git repository with the same library in the camera ready version of the paper.
Open Datasets	Yes	Table 1: Summary of datasets used in our experiments. ... ESC-50 dataset embedded using Open L3; artificial vs natural classification [Pic15, CWSB19] ... Res Net-50 embeddings of CIFAR-10 cat and dog classes [Kri09, Tor16] ... Inception v3 embeddings of dog and fish images from Image Net [SVI+16, RDS+15] ... Bag-of-words embeddings of the standard spam vs ham dataset [KATL19, MAP06] ... BERT embeddings of the IMDB sentiment dataset [MDP+11, DCLT19]
Dataset Splits	Yes	We embed these audio samples using last-layer embeddings of the Open L3 python library [CWSB19]. This produces d = 512 dimensional embeddings, and we separate them into train and test samples using a random 80 20 train-test split.
Hardware Specification	Yes	All experiments were conducted on a server equipped with 64GB RAM, 2 IBM POWER9 CPU cores, and 4 NVIDIA Tesla V100 SXM2 GPUs (each with 32GB memory).
Software Dependencies	Yes	Table 4: License summary for pretrained models and libraries. ... Open L3 v0.4.2 ... Res Net-50 (Torch Vision) v0.13.1 ... Inception v3 ... spa Cy v3.8.2 ... BERT (Transformers) bert-baseuncased (v4.36.2)
Experiment Setup	Yes	We fit all the logistic regression models using the scipy.optimize.minimize function to train the model using L-BFGS-B, and set a very strict stopping criterion to ensure that we converge to the global optimum and suppress dependencies on the initial weights when using a warm-start retrain. For the Dog Fish and Enron datasets also considered by Koh et al., we used the same L2 regularization parameter, and for all new datasets, we set the regularization to 1E 5. ... For each such size k, we construct removal sets of size k using the following strategies 1. Clustered Samples: ... 2. Top Percentile Samples: ... 3. Random Subsets: k samples selected uniformly at random.