Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Restoring Pruned Large Language Models via Lost Component Compensation

Authors: Zijian Feng, Hanzhang Zhou, Zixiao Zhu, Tianjiao Li, Chua Deryl, Lee Onn Mak, Gee Ng, Kezhi Mao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Restore LCC consistently outperforms state-of-the-art baselines in both general and task-specific performance recovery, without compromising the sparsity or inference efficiency of pruned models. We empirically evaluate Restore LCC against other performance restoration methods across all three types of pruned models (e.g., structured, semi-structured, and unstructured), on both general recovery and task-specific settings across a wide range of LLMs of different sizes.
Researcher Affiliation	Collaboration	1School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 2Home Team Science and Technology Agency (HTX), Singapore
Pseudocode	No	The paper describes the methodology using text and mathematical equations (e.g., Eq. 1-8) and block diagrams (Figure 4), but does not include a distinct pseudocode or algorithm block.
Open Source Code	Yes	2Code: https://github.com/zijian678/restorelcc/
Open Datasets	Yes	We conduct an empirical study using Bool Q [36], a widely used commonsense reasoning dataset. We follow Slim GPT and use the Alpaca instruction dataset [10] for tuning. The pruning ratio is set to 50% for unstructured and semi-structured methods, and 20% for structured pruning, with C4 [8] as the calibration dataset. We assess perplexity (PPL) of language modeling on the held-out Wiki Text [9] and accuracy of several commonsense reasoning benchmarks, including Bool Q [36], Hella Swag [40], Wino Grande [41], ARC-easy [42], ARC-challenge [42], RTE [43], and Open Book QA [44], all evaluated using the lm-eval-harness framework [45].
Dataset Splits	Yes	The dataset is divided into training and validation subsets in a 7:3 ratio, and models are trained for 100 epochs based on empirical observations. For task-specific recovery on the Bool Q dataset, only 200 samples are sufficient to reach an accuracy of approximately 76.
Hardware Specification	Yes	All experiments are conducted on a single H100 GPU. (1 H100 GPU, torch.bfloat16 precision, max-length=512, batch-size=8, Alpaca Dataset).
Software Dependencies	No	The paper mentions using 'Python hook functions' and refers to 'open-source Pyvene 4 package' and 'Transformer Lens 5 package' without providing specific version numbers for these or other core software dependencies like Python or PyTorch. It only implies PyTorch usage via 'torch.bfloat16 precision' but no version number is given.
Experiment Setup	Yes	The batch size is set to 8. The learning rate is {1e-4, 1e-5}. The max sequence length is 512. All experiments are conducted on a single H100 GPU. For Lo RA and Do RA, we use the same settings: α = 16 and rank = 8. Regarding the applied modules, we try two configurations: (1) ["v_proj", "o_proj"], which tunes only the head output matrices; and (2) ["q_proj", "k_proj", "v_proj", "o_proj"], which tunes all head matrices. For Lo Fi T, we experiment with 10%, 20%, and 30% of the heads and report the best results.