Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning
Authors: Ayan Sengupta, Siddhant Chaudhary, Tanmoy Chakraborty
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical analysis demonstrates that Prune Net can compress the LLa MA-2-7B model in just 15 minutes achieving over 80% retention of its zero-shot performance with a 30% compression ratio, outperforming existing methods that retain only 75% performance. Furthermore, on complex multitask language understanding tasks, Prune Net demonstrates its robustness by preserving up to 80% performance of the original model... Table 1: A summary of the experimental results. ... Table 2 reports the zero-shot performance of LLa MA-2-7B and Phi-2 models after being compressed with Prune Net and Slice GPT (the best baseline) at different compression ratios. |
| Researcher Affiliation | Academia | Ayan Sengupta , Siddhant Chaudhary & Tanmoy Chakraborty Department of Electrical Engineering Indian Institute of Technology Delhi, India EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Policy-Driven Model Compression Framework (Prune Net) Require: LLM with L layers, FFN1 weight matrices {Wl}L l=1, compression ratio r, policy learner parameters Winter, Wproj, discount factor γ Ensure: Compressed LLM with pruned FFN layers 1: Initialize policy learner parameters 2: for each training step do ... |
| Open Source Code | Yes | 1The source code of Prune Net is made public at https://github.com/LCS2-IIITD/ Prune Net. |
| Open Datasets | Yes | For the zero-shot performance evaluation, we use five commonsense reasoning tasks PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), Hella Swag (Zellers et al., 2019), ARC-e and ARC-c (Clark et al., 2018), using the LM Evaluation Harness suite (Gao et al., 2024) and the MMLU benchmark (Hendrycks et al., 2020) 7. For fine-tuning, we use Lo RA adapters (Hu et al., 2022) with rank 8. Interestingly, RFT has only a marginal impact of 1.5% on the compressed LLa MA model, which highlights the robustness of our method. Remarkably, the importance of RFT remains the same for a higher compression rate. On the other hand, with Phi-2, the performance drops after RFT in several cases. This result validates the robustness of Prune Net but also an appreciation of the pre-training objective of the small language models such as Phi-2 that uses specialized curated datasets for pre-training. Recovery fine-tuning (RFT) is a common trick to regain performance drop after compression. To understand the importance of RFT on the effectiveness of Prune Net, we report the zero-shot performance of compressed LLa MA and Phi-2 models after fine-tuning on the Wiki Text2 (Merity et al., 2016) dataset in Table 3. |
| Dataset Splits | No | For the zero-shot performance evaluation, we use five commonsense reasoning tasks PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), Hella Swag (Zellers et al., 2019), ARC-e and ARC-c (Clark et al., 2018), using the LM Evaluation Harness suite (Gao et al., 2024) and the MMLU benchmark (Hendrycks et al., 2020) 7. For fine-tuning, we use Lo RA adapters (Hu et al., 2022) with rank 8. ... The Wiki Text (Merity et al., 2016) dataset... The Penn Treebank (PTB) (Marcus et al., 1993) dataset... The Alpaca (Taori et al., 2023) dataset... We use only up to 8000 samples from these datasets for recovery fine-tuning. The paper does not specify exact training/validation/test splits, percentages, or absolute sample counts for each split, but mentions using up to 8000 samples for fine-tuning. |
| Hardware Specification | Yes | All the experiments were performed on a single Nvidia A100-40GB GPU. |
| Software Dependencies | No | For the policy learner model, we consider the discount factor, γ = 0.99 and use the Adam W (Loshchilov, 2017) optimizer with a learning rate of 5e 4 and a maximum of 20 episodes. ... For fine-tuning, we use Lo RA adapters (Hu et al., 2022) with rank 8. The paper mentions the optimizer and adapters used, but does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | For the policy learner model, we consider the discount factor, γ = 0.99 and use the Adam W (Loshchilov, 2017) optimizer with a learning rate of 5e 4 and a maximum of 20 episodes. |