Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training

Authors: Elia Cunegatti, Leonardo Lucio Custode, Giovanni Iacca

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test our method over 300 test cases with four LLM families, three sparsity ratios, and ten language tasks (three language modeling and seven zero-shot datasets), showing how it consistently outperforms the latest state-of-the-art methods in terms of performance-runtime trade-off.
Researcher Affiliation	Collaboration	Elia Cunegatti EMAIL University of Trento, Italy Leonardo Lucio Custode EMAIL Independent Researcher Giovanni Iacca EMAIL University of Trento, Italy
Pseudocode	Yes	Figure 3: Left: Overall Neuron Al top-up pruning procedure. Right: Get Best Neuron AL sub-routine used in both blockand row-selection stages.
Open Source Code	Yes	The code is available at https://github.com/eliacunegatti/Neuro AL.
Open Datasets	Yes	Language Modeling Datasets To measure the models perplexity on Language Modeling datasets, we use the following three datasets: (1) Wiki Text2 (Merity et al., 2017), (2) Colossal Clean Common Crawl (C4) (Raffel et al., 2020), and (3) Penn Treebank (PTB). Zero-Shot Tasks To assess more thoroughly how the different pruning algorithms affect the models capabilities, we employ the following 7 datasets: (1) Recognizing Textual Entailment (RTE) (Dagan et al., 2006; Bar Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) , (2) Wino Grande (Sakaguchi et al., 2021), (3) Bool Q (Clark et al., 2019), (4) Hella Swag (Zellers et al., 2019), (5) ARC-e (Clark et al., 2018), (6) ARC-c (Clark et al., 2018), (7) OBQA (Mihaylov et al., 2018)
Dataset Splits	Yes	For all the pruning algorithms that use calibration data (i.e., multiflow, Wanda, and Sparse GPT), we use 128 samples from the C4 dataset, as in (Frantar & Alistarh, 2023; Sun et al., 2023; Yin et al., 2024). ... For both C and Cλ, we use the same seed (0) for the calibration set, i.e., Cλ contains the first 8 elements of C.
Hardware Specification	Yes	All the experiments have been run on NVIDIA A100 GPUs, both with 40 and 80 GB. ... The evaluation consists of the end-to-end token generation and has been done over an Intel i910980XE CPU using 18 cores.
Software Dependencies	No	The paper mentions 'inference pipeline based on Deep Sparse (Neural Magic, 2021) ONNXRuntime backends' but does not specify version numbers for these or other software libraries.
Experiment Setup	Yes	For OWL, we set the hyperparameters to the values that are used mostly in the original paper, hence M = 5 and λ = 0.08; we do the same for Alpha Pruning, setting ϵ = 0.3. ... In the experiments, we set λset = [0.01, 0.02, 0.03, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.12, 0.15, 0.20,0.25] for the block step, while for the row step, we also added 0.0 (in case of no performance improvement). ... For both C and Cλ, we use the same seed (0) for the calibration set