Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions

Authors: Hadi Askari, Shivanshu Gupta, Fei Wang, Anshuman Chhabra, Muhao Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the utility of our scores by leveraging them for two downstream applications: (a) expert allocation in Lo RA-Mo E architectures and (b) layer-wise sparsity distribution for LLM pruning. Experiments across multiple LLM architectures demonstrate that our model-agnostic, influence-guided allocation leads to consistent gains in task performance.
Researcher Affiliation	Academia	University of California, Davis University of California, Irvine University of Southern California University of South Florida
Pseudocode	Yes	A step-by-step walkthrough of our algorithm is provided in Appendix C for greater clarity. Concrete walk-through for expert allocation (Common Q dataset, 32-layer LLM, T=160 experts, β = 3)
Open Source Code	Yes	All code and reproducibility configurations are provided in Appendix I and we provide further information on statistical significance of our results in G. Our code has been anonymized and uploaded here: https://github.com/Hadi Askari/Expert_ Allocation and https://github.com/Hadi Askari/Layer IF_Pruning_New.
Open Datasets	Yes	The datasets used include: MRPC [93], Co LA [94], Science QA [95], Commonsense QA [96], and Open Book QA [97]. For layer sparsity allocation for model pruning we again assess the post-pruning zero-shot performance of our models on several NLP tasks. Namely, Bool Q [98], Hellaswag [99], Winogrande [100], ARC Easy and ARC Challenge [101], RTE [94], and Open Book QA [97].
Dataset Splits	No	For the layer-wise expert allocation application we train our models for 3 epochs and then compare zero-shot accuracy on the datasets using a protocol similar to [47]. For layer sparsity allocation for model pruning we again assess the post-pruning zero-shot performance of our models on several NLP tasks.
Hardware Specification	Yes	All of our experiments run on 8 NVIDIA RTX 6000 Ada GPUs.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers. While it implicitly uses LLM frameworks, no versioned libraries or software components are mentioned.
Experiment Setup	Yes	For the layer-wise expert allocation application we train our models for 3 epochs... We experimented with 3 LAYERIF configurations: using all samples LAYERIF (ALL), using only positively influential training samples LAYERIF (+VE) and using the top 25 % of the most influential training samples for our layer-wise summation: LAYERIF (TOP 25%). We prune our LLMs to 50% sparsity... We used a window-length of 7 and a polyorder of 3 for our filter.