In defense of parameter sharing for model-compression
Authors: Aditya Desai, Anshumali Shrivastava
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we comprehensively assess the trade-off between memory and accuracy across RPS, pruning techniques, and building smaller models. Our findings demonstrate that RPS, which is both data and model-agnostic, consistently outperforms smaller models and all moderately informed pruning strategies, such as MAG, SNIP, SYNFLOW, and GRASP, across the entire compression range. This advantage becomes particularly pronounced in higher compression scenarios. Notably, even when compared to highly informed pruning techniques like Lottery Ticket Rewinding (LTR), RPS exhibits superior performance in high compression settings. This points out inherent capacity advantage that RPS enjoys over sparse models. Theoretically, we establish RPS as a superior technique in terms of memory-efficient representation when compared to pruning for linear models. This paper argues in favor of paradigm shift towards RPS based models. During our rigorous evaluation of RPS, we identified issues in the stateof-the-art RPS technique ROAST, specifically regarding stability (ROAST s sensitivity to initialization hyperparameters, often leading to divergence) and Paretocontinuity (ROAST s inability to recover the accuracy of the original model at zero compression). We provably address both of these issues. We refer to the modified RPS, which incorporates our improvements, as STABLE-RPS. |
| Researcher Affiliation | Collaboration | Aditya Desai Department of Computer Science Rice University Houston, TX 77054 apd10@cs.rice.edu Anshumali Shrivastava Department of Computer Science Rice University & Third AI Corp. Houston, TX 77054 anshumali@cs.rice.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states that "All of this tradeoff-data will be made public" but does not provide a concrete link or explicit statement about the release of the source code for the methodology described in the paper. |
| Open Datasets | Yes | CIFAR-10(RESNET-20) ... CIFAR-100 (VGG-11) (Table 1) ... CIFAR-10 CIFAR-100 TINY-IMAGENET (Table 2) ... We ran experiments on PPI dataset (Agrawal et al., 2018) and OGB-ARXIV dataset (Hu et al., 2020) with Graph Attention Network (Velickovic et al., 2017) Model. ... We experiment with Criteo-Kaggle dataset |
| Dataset Splits | No | The paper lists datasets and training hyperparameters but does not provide specific details about the training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit statements about using standard splits with citations). |
| Hardware Specification | Yes | The results presented in this section summarize over 1300+ experiments performed on V100 and Quadro RTX 8000 GPUs. |
| Software Dependencies | No | The paper mentions using the "DGL library" but does not provide specific version numbers for it or any other key software components or libraries used in the experiments. |
| Experiment Setup | Yes | Table 2: hyperparameters for both models RESNET20 and VGG11 ... Base Learning Rate 0.1 ... Milestones 80,120 ... Learning rate drop 0.1 ... batch size 128 ... total epochs 160 ... The standard deviation for the RPS array is set to 0.01 for RESNET20 model and 0.05 for VGG11 model. ... Pruning iterations We use 100 iterations for snip and synflow |