Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection

Authors: Pingbang Hu, Joseph Melkonian, Weijing Tang, Han Zhao, Jiaqi Ma

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FACTGRASS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. ... In this section, we evaluate the effectiveness of GRASS and FACTGRASS in terms of accuracy and efficiency. Specifically, in Section 4.1, we first perform the standard counterfactual evaluations to quantitatively study the data valuation accuracy of GRASS and FACTGRASS on small-scale setups. Then, we scale FACTGRASS to a billion-scale model and billion-token dataset, where we investigate the qualitative accuracy and memory/compute efficiency in Section 4.2.
Researcher Affiliation	Collaboration	1University of Illinois Urbana-Champaign 2Womp Labs 3Carnegie Mellon University EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes algorithms like GRASS and FACTGRASS and illustrates their components with figures (e.g., Figures 6, 7, 8). However, it does not contain clearly labeled pseudocode or algorithm blocks with structured steps in a code-like format.
Open Source Code	Yes	Our code is publicly available at https://github.com/TRAIS-Lab/GraSS. ... To address these issues, we developed a SJLT CUDA kernel3 that optimizes the memory access patterns and minimizes thread contention to better exploit the underlying hardware capabilities. This kernel significantly reduces the overhead compared to its PyTorch implementation counterpart, resulting in substantial performance gains. As shown in Figure 4, for SJLT with s = 1, our CUDA implementation outperforms the highly optimized dense matrix multiplications for small projection problem sizes, while retaining the speedup of SJLT w.r.t. input sparsity. 3The code is publicly available at https://github.com/TRAIS-Lab/sjlt.
Open Datasets	Yes	We conduct an ablation study on a simple 3-layer MLP trained on MNIST [Le Cun, 1998]. ... 1.) ResNet9 [He et al., 2016] with CIFAR2 [Krizhevsky and Hinton, 2009], and 2.) Music Transformer [Huang et al., 2019] with MAESTRO [Hawthorne et al., 2019]. ... We consider a small language model, GPT2-small [Radford et al., 2019] fine-tuned on the WikiText dataset [Merity et al., 2016] ... Llama-3.1-8B-Instruct [Meta AI, 2024] with a random 1B-token subset of the Open Web Text dataset [Gokaslan et al., 2019]
Dataset Splits	Yes	Table 3: Model details used in the experiments. Models Datasets (License) Task Parameter Size Train Samples Test Samples Sequential Length MLP MNIST (CC BY-SA 3.0) Image Classification 0.11M 5,000 500 1 ResNet9 CIFAR2 (MIT) Image Classification 4.83M 5,000 500 1 Music Transformer MAESTRO (CC BY-NC-SA 4.0) Music Generation 13.3M 5,000 178 1 GPT2-small WikiText (CC BY-SA 3.0) Text Generation 124M 4,656 481 512
Hardware Specification	Yes	All the experiments in quantitative analysis are conducted on Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with a single Nvidia A40 GPU with 48 GB memory. On the other hand, the qualitative analysis experiment is conducted on the VISTA5 cluster with one Grace Hopper (GH) node, where each GH node has one H200 GPU with 96 GB of HBM3 memory and one Grace CPU with 116 GB of LPDDR memory.
Software Dependencies	No	The paper mentions using PyTorch, AdamW optimizer, and the dattri library, but does not provide specific version numbers for these software components. For example, it mentions "recent updates to PyTorch" but no version.
Experiment Setup	Yes	For GPT2-small, we fine-tune the model on the WikiText dataset using the AdamW optimizer [Loshchilov and Hutter, 2019] with a learning rate of 5 · 10−5 and no weight decay, training for 3 epochs. ... We pick the damping λ for each setting (each model/dataset/compression method combination) via cross-validation grid search for LDS over λ ∈ {10−7, 10−6, 10−5, 10−4, 10−3, 10−2, 10−1, 1, 10, 102} on 10% of the test dataset, and evaluate the overall LDS result on the remaining 90% of the test dataset. ... We set the batch to be 7 that maximizes the usage of memory bandwidth for both LOGRA and FACTGRASS.