Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing Training Data Attribution with Representational Optimization

Authors: Weiwei Sun, Haokun Liu, Nikhil Kandpal, Colin A Raffel, Yiming Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on instruction-tuned LLMs demonstrate that Air Rep achieves performance on par with state-of-the-art gradient-based approaches while being nearly two orders of magnitude more efficient at inference time. Further analysis highlights its robustness and generalization across tasks and models.
Researcher Affiliation	Academia	Weiwei Sun1 Haokun Liu2 Nikhil Kandpal2 Colin Raffel2 Yiming Yang1 1Carnegie Mellon University 2University of Toronto & Vector Institute
Pseudocode	No	The paper describes the data generation pipeline and training objective in Section 3.2, and provides mathematical derivations in Appendices A and B, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/sunnweiwei/Air Rep.
Open Datasets	Yes	Building on the setup of datamodels [1], we apply our approach on FLAN [14], an instruction-tuning dataset, and Ultra Chat [15], a large-scale dialogue generation dataset, and evaluate the model on five unseen instruction-tuning test set (FLAN, Alapca, Tulu, Safe RLHF) that do not appear in the training data.
Dataset Splits	Yes	To generate training signal, we set Nv = 104 and Nt = 105. The training subsets number is M = 100, with each subsets containing n = 1,000 samples. We construct 100 cross-validation instances. Thus, in total, the data includes 104 unique training subsets and 107 training examples.
Hardware Specification	Yes	This results in a total of 10M training examples for the Qwen2.5-0.5B LM, requiring about 20 hours on eight A100 GPUs.
Software Dependencies	No	The paper mentions the use of the Qwen2.5 model family [36], Adam W optimizer [37], Logix software library for Lo Gra, and sklearn for TF-IDF, but does not provide specific version numbers for these or other key software components like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Model Our experiments focus on LM finetuning, using the Qwen2.5 model family [36] as our base LMs. During training, we start with the base LM and fine-tune it using a batch size of 32 and the Adam W optimizer [37] with a learning rate of 2e-5 for two epochs.