Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-Objective One-Shot Pruning for Large Language Models

Authors: Weiyu Chen, Hansi Yang, Yunhao Gou, Han Shi, Enliang Hu, Zhenguo Li, James Kwok

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on various LLMs and sparsity levels show that MOSP outperforms baselines in navigating multi-objective trade-offs and provides a superior set of pruned models.
Researcher Affiliation	Collaboration	Weiyu Chen1 Hansi Yang1 Yunhao Gou1,2 Han Shi3 Enliang Hu4 Zhenguo Li3 James T. Kwok1 1The Hong Kong University of Science and Technology 2Southern University of Science and Technology 3Huawei Noah s Ark Lab 4Yunnan Normal University
Pseudocode	Yes	Algorithm 1 PCG with vectorization [30] for ith task.
Open Source Code	No	The code will be released later.
Open Datasets	Yes	We consider three representative datasets to evaluate model performance across different domains: (1) General Text: C4 [36]. (2) Mathematical Reasoning: GSM8K [9]. (3) Code Generation: Code [5]. Following [15, 30], we use a calibration set of 128 segments (up to 2048 tokens) and evaluate using perplexity (PPL) on the test sets (the lower the better). C4 (Colossal Clean Crawled Corpus)[36]: ODC-BY License. This is a large-scale, cleaned version of the Common Crawl dataset, containing primarily English text. Following common practice [37, 15, 30], we use a subset of C4 dataset. We use the same data split as [30]. GSM8K (Grade School Math 8K)[9]: MIT License. A dataset comprising high-quality grade school mathematics word problems designed to evaluate the multi-step reasoning capabilities of models. We follow the official data split. Python Code[5]: CC-BY-4.0 License. An instruction-following dataset tailored for Python code generation, styled after the Alpaca dataset. We follow the official data split.
Dataset Splits	Yes	Following [15, 30], we use a calibration set of 128 segments (up to 2048 tokens) and evaluate using perplexity (PPL) on the test sets (the lower the better). Following common practice [37, 15, 30], we use a subset of C4 dataset. We use the same data split as [30]. GSM8K (Grade School Math 8K)[9]: MIT License. A dataset comprising high-quality grade school mathematics word problems designed to evaluate the multi-step reasoning capabilities of models. We follow the official data split. Python Code[5]: CC-BY-4.0 License. An instruction-following dataset tailored for Python code generation, styled after the Alpaca dataset. We follow the official data split.
Hardware Specification	Yes	We perform all experiments on a GPU server equipped with 8 NVIDIA A6000 GPUs, each with 48GB of memory. However, for each individual experiment, we utilize only a single GPU.
Software Dependencies	Yes	We use Py Torch version 2.0.0 [35].
Experiment Setup	Yes	We set α = 0.5 and p = 0.5 without hyperparameter tuning. Additional experimental details are provided in Appendix D. Specifically, we set the initial penalty parameter ρ0 = 0.1. We update ρ every 3 iterations based on a step function that depends on the current value of ρt and st := \| Supp(D(t)) Supp(D(t 3))\|. The term st represents the number of elements in the symmetric difference between the support of D at iteration t and iteration t 3. Specifically, the update rule is: 1.3ρt if st 0.1k, 1.2ρt if st 0.005k, 1.1ρt if st 1. (50) where k is the total number of elements or parameters being considered for pruning. If st = 0, it indicates that ρ is sufficiently large and the support has stabilized. For stage 2, the task-specific ADMM, we use a fixed ρ = 0.5. We set γ to 0.01Tr(X X). We refine each model with 10 PCG iterations.