Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Simple Linear Patch Revives Layer-Pruned Large Language Models

Authors: Xinrui Chen, Haoli Bai, Tao Yuan, ruikang liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, Chun Yuan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results demonstrate the effectiveness of LINEARPATCH across diverse LLMs and tasks. For example, on the question answering benchmark, LINEARPATCH preserves up to 94.15% of the performance when pruning 5 layers from LLa MA-3-8B, significantly outperforming state-of-the-art methods such as LLM-Streamline (90.84%).
Researcher Affiliation	Collaboration	Xinrui Chen1, Haoli Bai2 , Tao Yuan2, Ruikang Liu1, Kang Zhao2, Xianzhi Yu2, Lu Hou2, Tian Guan1, Yonghong He1, Chun Yuan1 1Shenzhen International Graduate School, Tsinghua University 2Huawei Technologies EMAIL, EMAIL
Pseudocode	No	The paper describes the method and its steps in Section 3 'LINEARPATCH: the Ultimate Recipe' using textual descriptions and mathematical equations, but does not include a distinct pseudocode or algorithm block.
Open Source Code	Yes	Code is available at https: //github.com/chenxinrui-tsinghua/Linear Patch.
Open Datasets	Yes	For PPL, we evaluate language modeling on Wiki Text-2 (WIKI-2) [38], C4 [39], and PTB [36]. For MMLU, we report five-shot accuracy on the full benchmark [22]. For QA, we evaluate on nine commonsense QA tasks: ARC-Challenge (ARC-c), ARC-Easy (ARC-e) [15], Bool Q [14], Hella Swag (He Sw) [56], PIQA [6], Wino Grande (WG) [1], WSC273 (WSC) [28], Race-high (Race-h) [27] and Co PA [41].
Dataset Splits	Yes	To determine pruned layers and initialize channel-wise scaling parameters, we use 128 randomly sampled sentences with sequence length 2048 from Wiki Text-2 for calibration. ... For fine-tuning LINEARPATCH, we use Adam W with a learning rate of 1e 4, training for one epoch on 5,000 Wiki Text-2 sentences of length 2048.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA V100 GPU with 24GB memory.
Software Dependencies	Yes	Our implementation is based on Py Torch. ... For PPL and QA benchmarks, we use the lm_eval library (version 0.4.4) from https://github.com/Eleuther AI/lm-evaluation-harness.
Experiment Setup	Yes	To determine pruned layers and initialize channel-wise scaling parameters, we use 128 randomly sampled sentences with sequence length 2048 from Wiki Text-2 for calibration. Ablation studies on the calibration set size and distillation dataset size are reported in Appendix C, calibration dataset in Appendix D and Appendix E, respectively. For fine-tuning LINEARPATCH, we use Adam W with a learning rate of 1e 4, training for one epoch on 5,000 Wiki Text-2 sentences of length 2048.