Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning

Authors: Ayan Sengupta, Siddhant Chaudhary, Tanmoy Chakraborty

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical analysis demonstrates that Prune Net can compress the LLa MA-2-7B model in just 15 minutes achieving over 80% retention of its zero-shot performance with a 30% compression ratio, outperforming existing methods that retain only 75% performance. Furthermore, on complex multitask language understanding tasks, Prune Net demonstrates its robustness by preserving up to 80% performance of the original model... Table 1: A summary of the experimental results. ... Table 2 reports the zero-shot performance of LLa MA-2-7B and Phi-2 models after being compressed with Prune Net and Slice GPT (the best baseline) at different compression ratios.
Researcher Affiliation	Academia	Ayan Sengupta , Siddhant Chaudhary & Tanmoy Chakraborty Department of Electrical Engineering Indian Institute of Technology Delhi, India EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Policy-Driven Model Compression Framework (Prune Net) Require: LLM with L layers, FFN1 weight matrices {Wl}L l=1, compression ratio r, policy learner parameters Winter, Wproj, discount factor γ Ensure: Compressed LLM with pruned FFN layers 1: Initialize policy learner parameters 2: for each training step do ...
Open Source Code	Yes	1The source code of Prune Net is made public at https://github.com/LCS2-IIITD/ Prune Net.
Open Datasets	Yes	For the zero-shot performance evaluation, we use five commonsense reasoning tasks PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), Hella Swag (Zellers et al., 2019), ARC-e and ARC-c (Clark et al., 2018), using the LM Evaluation Harness suite (Gao et al., 2024) and the MMLU benchmark (Hendrycks et al., 2020) 7. For fine-tuning, we use Lo RA adapters (Hu et al., 2022) with rank 8. Interestingly, RFT has only a marginal impact of 1.5% on the compressed LLa MA model, which highlights the robustness of our method. Remarkably, the importance of RFT remains the same for a higher compression rate. On the other hand, with Phi-2, the performance drops after RFT in several cases. This result validates the robustness of Prune Net but also an appreciation of the pre-training objective of the small language models such as Phi-2 that uses specialized curated datasets for pre-training. Recovery fine-tuning (RFT) is a common trick to regain performance drop after compression. To understand the importance of RFT on the effectiveness of Prune Net, we report the zero-shot performance of compressed LLa MA and Phi-2 models after fine-tuning on the Wiki Text2 (Merity et al., 2016) dataset in Table 3.
Dataset Splits	No	For the zero-shot performance evaluation, we use five commonsense reasoning tasks PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), Hella Swag (Zellers et al., 2019), ARC-e and ARC-c (Clark et al., 2018), using the LM Evaluation Harness suite (Gao et al., 2024) and the MMLU benchmark (Hendrycks et al., 2020) 7. For fine-tuning, we use Lo RA adapters (Hu et al., 2022) with rank 8. ... The Wiki Text (Merity et al., 2016) dataset... The Penn Treebank (PTB) (Marcus et al., 1993) dataset... The Alpaca (Taori et al., 2023) dataset... We use only up to 8000 samples from these datasets for recovery fine-tuning. The paper does not specify exact training/validation/test splits, percentages, or absolute sample counts for each split, but mentions using up to 8000 samples for fine-tuning.
Hardware Specification	Yes	All the experiments were performed on a single Nvidia A100-40GB GPU.
Software Dependencies	No	For the policy learner model, we consider the discount factor, γ = 0.99 and use the Adam W (Loshchilov, 2017) optimizer with a learning rate of 5e 4 and a maximum of 20 episodes. ... For fine-tuning, we use Lo RA adapters (Hu et al., 2022) with rank 8. The paper mentions the optimizer and adapters used, but does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup	Yes	For the policy learner model, we consider the discount factor, γ = 0.99 and use the Adam W (Loshchilov, 2017) optimizer with a learning rate of 5e 4 and a maximum of 20 episodes.