Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization

Authors: Guanchen Li, Yixing Xu, Zeping Li, Ji Liu, Xuanwu Yin, Dong Li, Emad Barsoum

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Týr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model s performance while removing a challenging 50% of Llama-3.1-70B s parameters. 3 Experiments
Researcher Affiliation	Industry	Guanchen Li, Yixing Xu, Zeping Li, Ji Liu, Xuanwu Yin, Dong Li, Emad Barsoum Advanced Micro Devices, Inc. (AMD) EMAIL
Pseudocode	Yes	The algorithmic procedures for local pruning, supernet construction, evolutionary search, and the overall Týr-the-Pruner framework are detailed in Algorithms 1 to 4 of Section A.2.
Open Source Code	No	Answer: [No] Justification: At this stage, the experimental setup and details are sufficient to guarantee reproducibility. Further materials will be made available in future updates.
Open Datasets	Yes	For calibration, we consider Fine Web [31]... validated on the Wiki Text2 [27] test set. To evaluate the impact of compression across various downstream tasks, we report 0-shot accuracy on ARC [6], Bool Q [5], Hella Swag [51], Open Book QA [28], RTE [43], and Wino Grande [33] tasks, as well as 5-shot accuracy on the MMLU [13] benchmark.
Dataset Splits	Yes	We use perplexity as one evaluation metric for language comprehension performance [9], validated on the Wiki Text2 [27] test set. To evaluate the impact of compression across various downstream tasks, we report 0-shot accuracy on ARC [6], Bool Q [5], Hella Swag [51], Open Book QA [28], RTE [43], and Wino Grande [33] tasks, as well as 5-shot accuracy on the MMLU [13] benchmark.
Hardware Specification	Yes	All experiments for Týr-the-Pruner were conducted on 4 AMD Instinct MI250 (64GB) Accelerators, with models less than 13B parameters running on a single accelerator.
Software Dependencies	No	We implement Týr-the-Pruner with Py Torch [30] and leverage the Hugging Face Transformers and Datasets libraries [47] to manage models and datasets.
Experiment Setup	Yes	The prune-and-search process consists of 4 iterations, where the sparsity interval at the i-th iteration is set to 12.5%/2i 1. In each iteration, we explore 50 generations with 128 offspring candidates per generation. The sparsity shifts of the attention or FFN layers are independent to ensure the consistency of the sparsity interval granularity. Candidate validation is performed using the distillation-inspired metric with vocabulary logits. We follow [36] to enhance validation efficiency: the 128 offspring are first validated on 2K tokens, and the top 16 are selected. These 16 survivors are then validated on 16K tokens, from which the top 4 are selected, and finally, the best one is validated and selected on 128K tokens.