Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-Objective One-Shot Pruning for Large Language Models
Authors: Weiyu Chen, Hansi Yang, Yunhao Gou, Han Shi, Enliang Hu, Zhenguo Li, James Kwok
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on various LLMs and sparsity levels show that MOSP outperforms baselines in navigating multi-objective trade-offs and provides a superior set of pruned models. |
| Researcher Affiliation | Collaboration | Weiyu Chen1 Hansi Yang1 Yunhao Gou1,2 Han Shi3 Enliang Hu4 Zhenguo Li3 James T. Kwok1 1The Hong Kong University of Science and Technology 2Southern University of Science and Technology 3Huawei Noah s Ark Lab 4Yunnan Normal University |
| Pseudocode | Yes | Algorithm 1 PCG with vectorization [30] for ith task. |
| Open Source Code | No | The code will be released later. |
| Open Datasets | Yes | We consider three representative datasets to evaluate model performance across different domains: (1) General Text: C4 [36]. (2) Mathematical Reasoning: GSM8K [9]. (3) Code Generation: Code [5]. Following [15, 30], we use a calibration set of 128 segments (up to 2048 tokens) and evaluate using perplexity (PPL) on the test sets (the lower the better). C4 (Colossal Clean Crawled Corpus)[36]: ODC-BY License. This is a large-scale, cleaned version of the Common Crawl dataset, containing primarily English text. Following common practice [37, 15, 30], we use a subset of C4 dataset. We use the same data split as [30]. GSM8K (Grade School Math 8K)[9]: MIT License. A dataset comprising high-quality grade school mathematics word problems designed to evaluate the multi-step reasoning capabilities of models. We follow the official data split. Python Code[5]: CC-BY-4.0 License. An instruction-following dataset tailored for Python code generation, styled after the Alpaca dataset. We follow the official data split. |
| Dataset Splits | Yes | Following [15, 30], we use a calibration set of 128 segments (up to 2048 tokens) and evaluate using perplexity (PPL) on the test sets (the lower the better). Following common practice [37, 15, 30], we use a subset of C4 dataset. We use the same data split as [30]. GSM8K (Grade School Math 8K)[9]: MIT License. A dataset comprising high-quality grade school mathematics word problems designed to evaluate the multi-step reasoning capabilities of models. We follow the official data split. Python Code[5]: CC-BY-4.0 License. An instruction-following dataset tailored for Python code generation, styled after the Alpaca dataset. We follow the official data split. |
| Hardware Specification | Yes | We perform all experiments on a GPU server equipped with 8 NVIDIA A6000 GPUs, each with 48GB of memory. However, for each individual experiment, we utilize only a single GPU. |
| Software Dependencies | Yes | We use Py Torch version 2.0.0 [35]. |
| Experiment Setup | Yes | We set α = 0.5 and p = 0.5 without hyperparameter tuning. Additional experimental details are provided in Appendix D. Specifically, we set the initial penalty parameter ρ0 = 0.1. We update ρ every 3 iterations based on a step function that depends on the current value of ρt and st := | Supp(D(t)) Supp(D(t 3))|. The term st represents the number of elements in the symmetric difference between the support of D at iteration t and iteration t 3. Specifically, the update rule is: 1.3ρt if st 0.1k, 1.2ρt if st 0.005k, 1.1ρt if st 1. (50) where k is the total number of elements or parameters being considered for pruning. If st = 0, it indicates that ρ is sufficiently large and the support has stabilized. For stage 2, the task-specific ADMM, we use a fixed ρ = 0.5. We set γ to 0.01Tr(X X). We refine each model with 10 PCG iterations. |