Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs

Authors: Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on LLa MA-2 and LLa MA-3 show that Duo GPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39 compared to the baseline dense model.
Researcher Affiliation Academia Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda Yale University EMAIL
Pseudocode Yes Algorithm 1 The Duo GPT activation sparsity-aware pruning for one layer with target of pw unstructured sparsity.
Open Source Code Yes Code is available at Git Hub.
Open Datasets Yes Unless otherwise stated, the calibration dataset consists of 128 2048-token samples, randomly selected from the C4 training dataset (Raffel et al. 2020). To assess model performance, we evaluate the perplexity (PPL) of our Duo GPT-pruned models on the Wiki Text2 dataset (Merity et al. 2016). Furthermore, we complement our evaluation by conducting 0-shot task classifications using the LM Eval Harness (Gao et al. 2021) across widely recognized downstream benchmarks: PIQA (Bisk et al. 2020), Hella Swag (Zellers et al. 2019), Wino Grande (Sakaguchi et al. 2021), ARC-easy, ARC-challenge (Clark et al. 2018), Open Book QA (OBQA) (Mihaylov et al. 2018), and Bool Q (Clark et al. 2019).
Dataset Splits Yes Unless otherwise stated, the calibration dataset consists of 128 2048-token samples, randomly selected from the C4 training dataset (Raffel et al. 2020).
Hardware Specification Yes All calibration procedures and experiments are performed on 80GB NVIDIA A100 GPUs with offloading (two GPUs are specifically employed for zero-shot evaluations of 70B models).
Software Dependencies Yes We implement Duo GPT using Py Torch (Paszke 2019) and the Hugging Face Transformer library (Wolf et al. 2019) for efficient model and dataset management. All calibration procedures and experiments are performed on 80GB NVIDIA A100 GPUs with offloading (two GPUs are specifically employed for zero-shot evaluations of 70B models).
Experiment Setup Yes Unless otherwise stated, the calibration dataset consists of 128 2048-token samples, randomly selected from the C4 training dataset (Raffel et al. 2020). To assess model performance, we evaluate the perplexity (PPL) of our Duo GPT-pruned models on the Wiki Text2 dataset (Merity et al. 2016). Furthermore, we complement our evaluation by conducting 0-shot task classifications using the LM Eval Harness (Gao et al. 2021) across widely recognized downstream benchmarks: PIQA (Bisk et al. 2020), Hella Swag (Zellers et al. 2019), Wino Grande (Sakaguchi et al. 2021), ARC-easy, ARC-challenge (Clark et al. 2018), Open Book QA (OBQA) (Mihaylov et al. 2018), and Bool Q (Clark et al. 2019). More detailed setups can be found in the Appendix.