Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Compress Large Language Models via Collaboration Between Learning and Matrix Approximation
Authors: Yuesen Liao, Zhiwei Li, Binrui Wu, Zihao Cheng, Su Zhao, Shuai Chen, Weizhong Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Phi-3 and the LLama-2/3 family demonstrate the effectiveness of our method. Notably, it maintains over 95% zero-shot accuracy under 50% sparsity and achieves up to 2 inference speedup. Section 5 is titled Experiments and details the experimental setup and results. |
| Researcher Affiliation | Collaboration | 1Fudan University, 2Meituan Inc. 3Shanghai Key Laboratory of Intelligent Information Processing. Fudan University is an academic institution, while Meituan Inc. is an industry company, indicating a collaboration. |
| Pseudocode | Yes | Algorithm 1 Bilevel Optimization Framework. We present the RPCA algorithm based on QR decomposition in Appendix C.3 Algorithm 2. |
| Open Source Code | Yes | Yes, we will make our experimental code available in the supplementary materials. |
| Open Datasets | Yes | We use C4 [27] as the training dataset, ... The main benchmarks include: 1) Wiki Text2 [24] perplexity, 2) zero-shot tasks (including PIQA [3], Hella Swag [43], Winogrande [28], Open Book QA [25], RTE [35], Bool Q [5], ARC-e and ARC-c [6]), and 3) few shot tasks, like MMLU [14]. |
| Dataset Splits | No | The paper mentions "We use C4 [27] as the training dataset" and "the inner-level optimization employs the fixed 32 samples as the calibration dataset." However, it does not specify explicit training/validation/test splits (e.g., percentages or exact counts) for the main datasets used in the experiments. |
| Hardware Specification | Yes | The experiments are all completed with one single 80GB NVIDIA A100. In addition, we test the CPU inference speedup of the pruned model on Intel(R) Xeon(R) Platinum 8369B CPU @ 2. 90GHz with 32 cpu cores. |
| Software Dependencies | No | The paper mentions using the "Adam optimizer [16]" and the "Deep Sparse engine" but does not provide specific version numbers for these or any other key software libraries, frameworks (e.g., Python, PyTorch, CUDA), or environments. |
| Experiment Setup | Yes | We use C4 [27] as the training dataset, with batch size set to 32, and length set to 256. In addition, the inner-level optimization employs the fixed 32 samples as the calibration dataset. Gamma is set from 0.05 to 0.005. In training, we use the Adam optimizer [16] and set the learning rate to 1e-2. |