Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ModHiFi: Identifying High Fidelity predictive components for Model Modification

Authors: Dhruva Kashyap, Chaitanya Murti, Pranav K Nayak, Tanay Narshana, Chiranjib Bhattacharyya

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate our framework by addressing four central questions: (Q1) Existence of Hi Fi components. Do a small subset of components exist that can achieve high fidelity? (Q2) Effectiveness of HIFI components. Do HIFI components accurately represent those components important for the predictive performance? (Q3) Effectiveness of using HIFI components for pruning using Mod Hi Fi-P. Does Mod Hi Fi-P result in better accuracy-sparsity tradeoff compared to structured pruning algorithms for vision tasks and language modeling tasks? (Q4) Effectiveness of using HIFI components for machine unlearning using Mod Hi Fi-U. Is it possible to perform machine unlearning, as posed by Jia et al. [32], without finetuning? If so, how does Mod Hi Fi-U compare to their method?
Researcher Affiliation	Collaboration	Dhruva Kashyap CSA, IISc EMAIL; Chaitanya Murti HP Inc. AI Lab EMAIL; Pranav Nayak CSA, IISc EMAIL; Tanay Narshana Google EMAIL; Chiranjib Bhattacharyya CSA, IISc EMAIL
Pseudocode	Yes	Algorithm 1 Mod Hi Fi-X (on page 4) and Algorithm 2 Vi T-Edit-X: Structured Editing for Transformers (on page 12).
Open Source Code	Yes	Our code is available at https://github.com/Dhruva Kashyap/modhifi. The code and the instructions to reproduce the experiments are provided at the Git Hub: https://github.com/Dhruva Kashyap/modhifi.
Open Datasets	Yes	For CIFAR10/100 [35], we use synthetically generated images as detailed in Appendix C.3. We use Alpaca [74] (a synthetic dataset) and Wiki Text-2 [48] as calibration data for NLP tasks following related literature [2, 46]. For experiments with the CIFAR10 dataset, we use CIFAR5M, a dataset containing 6 million synthetic CIFAR-10-like images sampled from a Diffusion model and labeled by a Big-Transfer model [55], which we randomly sample 10,000 samples from each of the 10 classes to create our dataset. For experiments with the CIFAR100 dataset, we use CIFAR100-DDPM [23], which we randomly downsample to contain 1,000 samples from each of the 100 classes.
Dataset Splits	Yes	For CIFAR10/100 [35], we use synthetically generated images as detailed in Appendix C.3. We use Alpaca [74] (a synthetic dataset) and Wiki Text-2 [48] as calibration data for NLP tasks following related literature [2, 46]. For experiments with the CIFAR10 dataset, we use CIFAR5M, a dataset containing 6 million synthetic CIFAR-10-like images sampled from a Diffusion model and labeled by a Big-Transfer model [55], which we randomly sample 10,000 samples from each of the 10 classes to create our dataset. For experiments with the CIFAR100 dataset, we use CIFAR100-DDPM [23], which we randomly downsample to contain 1,000 samples from each of the 100 classes. Unless otherwise specified, the algorithms use Wiki Text-2 for calibration, with 128 samples of length 1024.
Hardware Specification	Yes	We sufficiently describe the compute resources used for our experiments in Section 5 and the appendices referred to within the section, specifically, Appendix C.6. Table 13 details the hardware we use to conduct our experiments... CPU Model Name AMD EPYC 9654 96-Core Processor... GPU Model name Instinct MI210 GPU(s) 4. Inference times are measured on a machine running Ubuntu 20.04.1 LTS with kernel 5.15.0-91generic on the hardware specified in Table 14. The software stack used for inference consists of Python 3.12.8, Py Torch 2.5.1, and Torchvision 0.20.1 for CUDA 12.3. Table 14: Specifications of GPU and CPU hardware used for computing inference time... CPU Model Name Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz... GPU Model name NVIDIA Ge Force RTX 2080 Ti CUDA version 12.3 GPU(s) 8.
Software Dependencies	Yes	Our software stack comprises of Python 3.12.8, Py Torch 2.5.1 built for ROCm 6.2, and torchvision version 0.20.1 built for ROCm 6.2. The software stack used for inference consists of Python 3.12.8, Py Torch 2.5.1, and Torchvision 0.20.1 for CUDA 12.3.
Experiment Setup	Yes	All details to understand the results and reproduce the results are provided in Section 5 and the appendices mentioned therein. We typically set the percentile of removed components to be between 0.01 to 0.2. We randomly select 2% of our synthetic samples to select data for vision tasks and select 128 samples for NLP tasks. Pretraining procedure: For CIFAR10 and CIFAR100, we train models using SGD with a momentum factor of 0.9 and weight decay of 5 10 4, for 200 epochs using Cosine Annealing step sizes with an initial learning rate of 0.1. Image Net post training: For Image Net, we use off-the-shelf pretrained models from Torchvision [58]. We train the model for 3 epochs after each iteration of pruning with learning rates of 0.1, 0.01, 0.001. After the pruning ends, we finally train the network for 160 epochs with a batch size of 512. We use the SGD Optimizer with a momentum factor of 0.9 and weight decay of 1 10 4 and start with an LR warm-up for 10 epochs, followed by Cosine Annealed step sizes with an initial learning rate of 0.1 with Cutmix and Mixup augmentations.