Compressing Transformers: Features Are Low-Rank, but Weights Are Not!

Authors: Hao Yu, Jianxin Wu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our methods can compress both vanilla transformers and its variants in CV and NLP. All the experiments were conducted with Py Torch. Table 1 shows the results of compressing Dei T-B and Swin-B & Swin-L. We tested model accuracy on the Image Net-1K validation dataset.
Researcher Affiliation Academia Hao Yu, Jianxin Wu* State Key Laboratory for Novel Software Technology, Nanjing University, China yuh@lamda.nju.edu.cn, wujx2001@nju.edu.cn
Pseudocode Yes Algorithm 1 Atomic Feature Mimicking Input: The original model M with weights W and bias b in the i-th layer, the proxy dataset D and a pre-set rank k. Output: Two compressed FC layers with weights W1 and W2, and biases b1 and b2. 1: for each sample x in D do 2: Forward propagate M(x) to obtain the output feature y in the i-th layer and update E[yy T ] and E[y]. 3: end for 4: Calculate the eigenvectors U based on Eq. 6 and Eq. 7. 5: Extract the first k columns of U into Uk, and obtain W1 = U T k W, b1 = U T k b, W2 = Uk, and b2 = E[y] Uk U T k E[y]. 6: return (W1, b1), (W2, b2)
Open Source Code No The paper does not provide any explicit statements about releasing source code, nor does it include links to a code repository for the described methodology.
Open Datasets Yes Classification. The Image Net-1K (Deng et al. 2009) dataset consists of 1.28 million training and 50K validation images. Objection Detection & Segmentation. We evaluate object detection & segmentation performance on the MS COCO2017 (Lin et al. 2014) dataset. Language Modeling. We also evaluate our approach in the Wiki Text-103 (Merity et al. 2017) dataset.
Dataset Splits Yes The Image Net-1K (Deng et al. 2009) dataset consists of 1.28 million training and 50K validation images. MS COCO2017 contains 80 categories with 118K training and 5K validation images, respectively. The training data of Wiki Text-103 comprises about 100M tokens with 28K articles and a vocabulary of around 260K. The test data contains 245K tokens with 4358 sentences.
Hardware Specification Yes We only compressed the four FC layers in the blocks and used eight NVIDIA 3090 GPUs to calculate the sensitivity scores. We also list the throughput in a 3090 GPU with a fixed 512 mini-batch size.
Software Dependencies No The paper states 'All the experiments were conducted with Py Torch.' but does not specify a version number for PyTorch or any other software dependencies such as Python, CUDA, or specific libraries with their versions.
Experiment Setup Yes When fine-tuning Dei T-B, we initialized the learning rate as 8e-5 and used a mini-batch size of 512. When we fine-tuned Swin-B & Swin-L, we set the learning rate and mini-batch size as 3e-5 and 256, respectively. In the above experiments, we used the Adam W (Loshchilov and Hutter 2018) optimizer and the cosine decay schedule (Loshchilov and Hutter 2017). The sub-models were fine-tuned with 1000 epochs and the weight decay was 0.01. Random horizontal flipping, color jittering, Mixup (Zhang et al. 2018) and Cut Mix (Yun et al. 2019) were applied as data augmentations.