Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models
Authors: Seungcheol Park, Hojun Choi, U Kang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experiments on GLUE and SQu AD benchmarks to demonstrate the performance of K-prune. |
| Researcher Affiliation | Academia | 1Seoul National University, Seoul, South Korea 2Kim Jaechul Graduate School of AI, KAIST, Seoul, South Korea |
| Pseudocode | Yes | Algorithm 1 Knowledge-Preserving Mask Search (KPMS) |
| Open Source Code | Yes | Our source code is available at https://github.com/snudm-starlab/K-prune |
| Open Datasets | Yes | We evaluate the performance of compressing the pretrained BERT (Devlin et al., 2019) and Distil BERT (Sanh et al., 2019) models on GLUE (Wang et al., 2019), SQu AD v1.1 (Rajpurkar et al., 2016), and v2 (Rajpurkar et al., 2018) under diverse compression rates. |
| Dataset Splits | No | The paper uses well-known benchmark datasets like GLUE and SQuAD, which have predefined splits, but does not explicitly state the specific train/validation/test splits used for the experiments for reproducibility. It mentions using '100K tokens from the training dataset as a sample dataset' for the K-prune process, but this is not presented as a general validation split. |
| Hardware Specification | Yes | We use NVIDIA 1080 Ti for all experiments. |
| Software Dependencies | No | The paper states 'We use Py Torch (Paszke et al., 2019), and the weights of the pretrained models in Transformers (Wolf et al., 2020)' and 'We use a linear solver2 in Py Torch (Paszke et al., 2019) to solve Equations (10) and (11)', but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We use 100K tokens from the training dataset as a sample dataset, and exploit the pretrained tokenizers in Transformers (Wolf et al., 2020) for counting. The size of the sample dataset is small compared to the GLUE and SQu AD datasets, e.g. around 0.64% of MNLI (Williams et al., 2018) dataset. We fix random seeds from 0 to 4 and report the average performance of the 5 runs. We use two combinations of hyperparameters (γ, λ, µ) {(2, 0, 64), (2, 0.00025, 64)} for all experiments of K-prune. |