Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models

Authors: Seungcheol Park, Hojun Choi, U Kang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments on GLUE and SQu AD benchmarks to demonstrate the performance of K-prune.
Researcher Affiliation Academia 1Seoul National University, Seoul, South Korea 2Kim Jaechul Graduate School of AI, KAIST, Seoul, South Korea
Pseudocode Yes Algorithm 1 Knowledge-Preserving Mask Search (KPMS)
Open Source Code Yes Our source code is available at https://github.com/snudm-starlab/K-prune
Open Datasets Yes We evaluate the performance of compressing the pretrained BERT (Devlin et al., 2019) and Distil BERT (Sanh et al., 2019) models on GLUE (Wang et al., 2019), SQu AD v1.1 (Rajpurkar et al., 2016), and v2 (Rajpurkar et al., 2018) under diverse compression rates.
Dataset Splits No The paper uses well-known benchmark datasets like GLUE and SQuAD, which have predefined splits, but does not explicitly state the specific train/validation/test splits used for the experiments for reproducibility. It mentions using '100K tokens from the training dataset as a sample dataset' for the K-prune process, but this is not presented as a general validation split.
Hardware Specification Yes We use NVIDIA 1080 Ti for all experiments.
Software Dependencies No The paper states 'We use Py Torch (Paszke et al., 2019), and the weights of the pretrained models in Transformers (Wolf et al., 2020)' and 'We use a linear solver2 in Py Torch (Paszke et al., 2019) to solve Equations (10) and (11)', but does not provide specific version numbers for these software components.
Experiment Setup Yes We use 100K tokens from the training dataset as a sample dataset, and exploit the pretrained tokenizers in Transformers (Wolf et al., 2020) for counting. The size of the sample dataset is small compared to the GLUE and SQu AD datasets, e.g. around 0.64% of MNLI (Williams et al., 2018) dataset. We fix random seeds from 0 to 4 and report the average performance of the 5 runs. We use two combinations of hyperparameters (γ, λ, µ) {(2, 0, 64), (2, 0.00025, 64)} for all experiments of K-prune.