Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs
Authors: Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on LLa MA-2 and LLa MA-3 show that Duo GPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39 compared to the baseline dense model. |
| Researcher Affiliation | Academia | Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda Yale University EMAIL |
| Pseudocode | Yes | Algorithm 1 The Duo GPT activation sparsity-aware pruning for one layer with target of pw unstructured sparsity. |
| Open Source Code | Yes | Code is available at Git Hub. |
| Open Datasets | Yes | Unless otherwise stated, the calibration dataset consists of 128 2048-token samples, randomly selected from the C4 training dataset (Raffel et al. 2020). To assess model performance, we evaluate the perplexity (PPL) of our Duo GPT-pruned models on the Wiki Text2 dataset (Merity et al. 2016). Furthermore, we complement our evaluation by conducting 0-shot task classifications using the LM Eval Harness (Gao et al. 2021) across widely recognized downstream benchmarks: PIQA (Bisk et al. 2020), Hella Swag (Zellers et al. 2019), Wino Grande (Sakaguchi et al. 2021), ARC-easy, ARC-challenge (Clark et al. 2018), Open Book QA (OBQA) (Mihaylov et al. 2018), and Bool Q (Clark et al. 2019). |
| Dataset Splits | Yes | Unless otherwise stated, the calibration dataset consists of 128 2048-token samples, randomly selected from the C4 training dataset (Raffel et al. 2020). |
| Hardware Specification | Yes | All calibration procedures and experiments are performed on 80GB NVIDIA A100 GPUs with offloading (two GPUs are specifically employed for zero-shot evaluations of 70B models). |
| Software Dependencies | Yes | We implement Duo GPT using Py Torch (Paszke 2019) and the Hugging Face Transformer library (Wolf et al. 2019) for efficient model and dataset management. All calibration procedures and experiments are performed on 80GB NVIDIA A100 GPUs with offloading (two GPUs are specifically employed for zero-shot evaluations of 70B models). |
| Experiment Setup | Yes | Unless otherwise stated, the calibration dataset consists of 128 2048-token samples, randomly selected from the C4 training dataset (Raffel et al. 2020). To assess model performance, we evaluate the perplexity (PPL) of our Duo GPT-pruned models on the Wiki Text2 dataset (Merity et al. 2016). Furthermore, we complement our evaluation by conducting 0-shot task classifications using the LM Eval Harness (Gao et al. 2021) across widely recognized downstream benchmarks: PIQA (Bisk et al. 2020), Hella Swag (Zellers et al. 2019), Wino Grande (Sakaguchi et al. 2021), ARC-easy, ARC-challenge (Clark et al. 2018), Open Book QA (OBQA) (Mihaylov et al. 2018), and Bool Q (Clark et al. 2019). More detailed setups can be found in the Appendix. |