Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration

Authors: Tianteng Gu, Bei Liu, Bo Xiao, Ke Zeng, Jiacheng Liu, Yanmin Qian

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluated on LLa MA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, Denoise Rotator consistently improves perplexity and zero-shot accuracy. For instance, on LLa MA3-70B pruned with Sparse GPT at 2:4 semistructured sparsity, Denoise Rotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. ... Extensive Empirical Evaluation: We demonstrate the effectiveness of Denoise Rotator on a range of open-source LLMs, including Mistral (7B) [24], LLa MA3 (8B, 70B) [17], and Qwen2.5 (7B, 14B, 32B, 72B) [42]. Our method consistently and significantly improves pruning performance across unstructured and semi-structured sparsity patterns, reducing perplexity and improving accuracy compared to baseline pruning methods.
Researcher Affiliation	Collaboration	Tianteng Gu Shanghai Jiao Tong University EMAIL Bei Liu HKUST EMAIL Bo Xiao Meituan EMAIL Ke Zeng Meituan EMAIL Jiacheng Liu HKUST EMAIL Yanmin Qian Shanghai Jiao Tong University EMAIL
Pseudocode	Yes	The following pseudocode outlines an example workflow of integrating Denoise Rotator (highlighted in blue) with layer-wise pruning methods: Algorithm 1 Denoise Rotator Integration Pipeline
Open Source Code	Yes	Codes are available at https://github.com/Axel-gu/Denoise Rotator.
Open Datasets	Yes	Our evaluation encompasses both language generation, measured by perplexity on the Wiki Text-2 dataset [32], and five widely-used zero-shot tasks: PIQA [5], Wino Grande [36], Hella Swag [43], ARC-e, and ARC-c [8]. Following established practices, we utilize the LM Evaluation Harness [15] with default settings for all evaluations. ... We use the same calibration dataset as in prior work, consisting of 128 sequences with a context length of 2048 tokens, sampled from the C4 training set [35].
Dataset Splits	Yes	Our evaluation encompasses both language generation, measured by perplexity on the Wiki Text-2 dataset [32], and five widely-used zero-shot tasks: PIQA [5], Wino Grande [36], Hella Swag [43], ARC-e, and ARC-c [8]. ... For consistency, we use the same calibration dataset as in prior work, consisting of 128 sequences with a context length of 2048 tokens, sampled from the C4 training set [35]. ... Dataset: 4096 samples of length 2048 from the Wiki Text2 train set
Hardware Specification	Yes	For instance, training on LLa MA 3 70B with Sparse GPT took approximately 28 hours and utilized around 30 GB of GPU memory on a single NVIDIA A100 GPU. ... We evaluate the average inference time of a single Transformer layer in LLa MA3-8B on 32 sequences of length 2048 using an NVIDIA A100 GPU.
Software Dependencies	No	The paper mentions PyTorch several times (e.g., "torch.bfloat16 precision", "Py Torch [34] provides automatic differentiation support for QR factorization through torch.qr.", "torch.sparse.to_sparse_semi_structured function") but does not provide specific version numbers for PyTorch or other libraries.
Experiment Setup	Yes	We trained the orthogonal matrices of Denoise Rotator using the Adam optimizer with a learning rate of 0.01 over 2000 steps. All computations, except for QR decomposition, were performed in torch.bfloat16 precision to enhance efficiency. ... For consistency, we use the same calibration dataset as in prior work, consisting of 128 sequences with a context length of 2048 tokens, sampled from the C4 training set [35]. We apply a uniform sparsity level across all decoder layers and evaluate two types of sparsity: unstructured 50% sparsity and semi-structured 2:4 sparsity. ... Fine-tuning Configuration Method: Lo RA Alpha: 32.0 Dropout: 0.1 Dataset: 4096 samples of length 2048 from the Wiki Text2 train set Learning Rate: 2e-4 Weight Decay: 1e-2 Optimizer: Adam Learning Rate Scheduler: Linear Warm-up Steps: 400