Compressing Transformers: Features Are Low-Rank, but Weights Are Not!
Authors: Hao Yu, Jianxin Wu
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our methods can compress both vanilla transformers and its variants in CV and NLP. All the experiments were conducted with Py Torch. Table 1 shows the results of compressing Dei T-B and Swin-B & Swin-L. We tested model accuracy on the Image Net-1K validation dataset. |
| Researcher Affiliation | Academia | Hao Yu, Jianxin Wu* State Key Laboratory for Novel Software Technology, Nanjing University, China yuh@lamda.nju.edu.cn, wujx2001@nju.edu.cn |
| Pseudocode | Yes | Algorithm 1 Atomic Feature Mimicking Input: The original model M with weights W and bias b in the i-th layer, the proxy dataset D and a pre-set rank k. Output: Two compressed FC layers with weights W1 and W2, and biases b1 and b2. 1: for each sample x in D do 2: Forward propagate M(x) to obtain the output feature y in the i-th layer and update E[yy T ] and E[y]. 3: end for 4: Calculate the eigenvectors U based on Eq. 6 and Eq. 7. 5: Extract the first k columns of U into Uk, and obtain W1 = U T k W, b1 = U T k b, W2 = Uk, and b2 = E[y] Uk U T k E[y]. 6: return (W1, b1), (W2, b2) |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code, nor does it include links to a code repository for the described methodology. |
| Open Datasets | Yes | Classification. The Image Net-1K (Deng et al. 2009) dataset consists of 1.28 million training and 50K validation images. Objection Detection & Segmentation. We evaluate object detection & segmentation performance on the MS COCO2017 (Lin et al. 2014) dataset. Language Modeling. We also evaluate our approach in the Wiki Text-103 (Merity et al. 2017) dataset. |
| Dataset Splits | Yes | The Image Net-1K (Deng et al. 2009) dataset consists of 1.28 million training and 50K validation images. MS COCO2017 contains 80 categories with 118K training and 5K validation images, respectively. The training data of Wiki Text-103 comprises about 100M tokens with 28K articles and a vocabulary of around 260K. The test data contains 245K tokens with 4358 sentences. |
| Hardware Specification | Yes | We only compressed the four FC layers in the blocks and used eight NVIDIA 3090 GPUs to calculate the sensitivity scores. We also list the throughput in a 3090 GPU with a fixed 512 mini-batch size. |
| Software Dependencies | No | The paper states 'All the experiments were conducted with Py Torch.' but does not specify a version number for PyTorch or any other software dependencies such as Python, CUDA, or specific libraries with their versions. |
| Experiment Setup | Yes | When fine-tuning Dei T-B, we initialized the learning rate as 8e-5 and used a mini-batch size of 512. When we fine-tuned Swin-B & Swin-L, we set the learning rate and mini-batch size as 3e-5 and 256, respectively. In the above experiments, we used the Adam W (Loshchilov and Hutter 2018) optimizer and the cosine decay schedule (Loshchilov and Hutter 2017). The sub-models were fine-tuned with 1000 epochs and the weight decay was 0.01. Random horizontal flipping, color jittering, Mixup (Zhang et al. 2018) and Cut Mix (Yun et al. 2019) were applied as data augmentations. |