Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs

Authors: Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on LLa MA-2 and LLa MA-3 show that Duo GPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39 compared to the baseline dense model.
Researcher Affiliation	Academia	Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda Yale University EMAIL
Pseudocode	Yes	Algorithm 1 The Duo GPT activation sparsity-aware pruning for one layer with target of pw unstructured sparsity.
Open Source Code	Yes	Code is available at Git Hub.
Open Datasets	Yes	Unless otherwise stated, the calibration dataset consists of 128 2048-token samples, randomly selected from the C4 training dataset (Raffel et al. 2020). To assess model performance, we evaluate the perplexity (PPL) of our Duo GPT-pruned models on the Wiki Text2 dataset (Merity et al. 2016). Furthermore, we complement our evaluation by conducting 0-shot task classifications using the LM Eval Harness (Gao et al. 2021) across widely recognized downstream benchmarks: PIQA (Bisk et al. 2020), Hella Swag (Zellers et al. 2019), Wino Grande (Sakaguchi et al. 2021), ARC-easy, ARC-challenge (Clark et al. 2018), Open Book QA (OBQA) (Mihaylov et al. 2018), and Bool Q (Clark et al. 2019).
Dataset Splits	Yes	Unless otherwise stated, the calibration dataset consists of 128 2048-token samples, randomly selected from the C4 training dataset (Raffel et al. 2020).
Hardware Specification	Yes	All calibration procedures and experiments are performed on 80GB NVIDIA A100 GPUs with offloading (two GPUs are specifically employed for zero-shot evaluations of 70B models).
Software Dependencies	Yes	We implement Duo GPT using Py Torch (Paszke 2019) and the Hugging Face Transformer library (Wolf et al. 2019) for efficient model and dataset management. All calibration procedures and experiments are performed on 80GB NVIDIA A100 GPUs with offloading (two GPUs are specifically employed for zero-shot evaluations of 70B models).
Experiment Setup	Yes	Unless otherwise stated, the calibration dataset consists of 128 2048-token samples, randomly selected from the C4 training dataset (Raffel et al. 2020). To assess model performance, we evaluate the perplexity (PPL) of our Duo GPT-pruned models on the Wiki Text2 dataset (Merity et al. 2016). Furthermore, we complement our evaluation by conducting 0-shot task classifications using the LM Eval Harness (Gao et al. 2021) across widely recognized downstream benchmarks: PIQA (Bisk et al. 2020), Hella Swag (Zellers et al. 2019), Wino Grande (Sakaguchi et al. 2021), ARC-easy, ARC-challenge (Clark et al. 2018), Open Book QA (OBQA) (Mihaylov et al. 2018), and Bool Q (Clark et al. 2019). More detailed setups can be found in the Appendix.