DRONE: Data-aware Low-rank Compression for Large NLP Models
Authors: Patrick Chen, Hsiang-Fu Yu, Inderjit Dhillon, Cho-Jui Hsieh
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that DRONE is able to improve both model size and inference speed with limited loss in accuracy. Specifically, DRONE alone achieves 1.92x speedup on the MRPC task with only 1.5% loss in accuracy, and when DRONE is combined with distillation, it further achieves over 12.3x speedup on various natural language inference tasks. |
| Researcher Affiliation | Collaboration | Patrick H. Chen UCLA Los Angels, CA patrickchen@g.ucla.edu Hsian-fu, Yu Amazon Palo Alto, CA rofu.yu@gmail.com Inderjit S. Dhillon UT Austin & Amazon Palo Alto, CA inderjit@cs.utexas.edu Cho-jui, Hsieh UCLA & Amazon Los Angels, CA chohsieh@cs.ucla.edu |
| Pseudocode | Yes | Algorithm 1 Data-Aware Low-rank Compression of feed-forward layer. Algorithm 2 Overall Low-rank Model Approximation Algorithm |
| Open Source Code | Yes | 3. (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | Yes | For LSTMs, we train a 2-layer LSTM-based language model from scratch with hidden sizes 1500 on Penn Treebank Bank (PTB) dataset. For BERT models, we evaluate the pre-trained BERT models on GLUE tasks. |
| Dataset Splits | Yes | empirically we found randomly sub-sample 10% of the training data suffices to provide good results. Using more data can only provide limited performance boost but comes at a higher cost of longer preprocessing time. Thus, we will use 10% random sample of the training data to perform the experiments. |
| Hardware Specification | Yes | Thus, we measure the inference speed on both CPU (Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz) and GPU (Ge Force GTX 1080 Ti) devices. |
| Software Dependencies | No | The paper mentions using 'Huggingface's transformers' and provides a URL to their examples but does not specify exact version numbers for any software libraries, frameworks, or dependencies used in their experiments. |
| Experiment Setup | Yes | For BERT models, we use BERT-base models and it contains 12 layers of the same model structure without sharing parameters. Each layer contains an attention module with hidden size 768 and 12 channels, a small 768 × 768 Feed-forward (FF) layer followed by 2 larger FF layers (768 × 3072 and 3072 × 768)... We use a relatively smaller learning rate 10−7 and retrain 1 epoch on the sub-sampled training data to complete the fine-tuning procedure. |