DRONE: Data-aware Low-rank Compression for Large NLP Models

Authors: Patrick Chen, Hsiang-Fu Yu, Inderjit Dhillon, Cho-Jui Hsieh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that DRONE is able to improve both model size and inference speed with limited loss in accuracy. Specifically, DRONE alone achieves 1.92x speedup on the MRPC task with only 1.5% loss in accuracy, and when DRONE is combined with distillation, it further achieves over 12.3x speedup on various natural language inference tasks.
Researcher Affiliation Collaboration Patrick H. Chen UCLA Los Angels, CA patrickchen@g.ucla.edu Hsian-fu, Yu Amazon Palo Alto, CA rofu.yu@gmail.com Inderjit S. Dhillon UT Austin & Amazon Palo Alto, CA inderjit@cs.utexas.edu Cho-jui, Hsieh UCLA & Amazon Los Angels, CA chohsieh@cs.ucla.edu
Pseudocode Yes Algorithm 1 Data-Aware Low-rank Compression of feed-forward layer. Algorithm 2 Overall Low-rank Model Approximation Algorithm
Open Source Code Yes 3. (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets Yes For LSTMs, we train a 2-layer LSTM-based language model from scratch with hidden sizes 1500 on Penn Treebank Bank (PTB) dataset. For BERT models, we evaluate the pre-trained BERT models on GLUE tasks.
Dataset Splits Yes empirically we found randomly sub-sample 10% of the training data suffices to provide good results. Using more data can only provide limited performance boost but comes at a higher cost of longer preprocessing time. Thus, we will use 10% random sample of the training data to perform the experiments.
Hardware Specification Yes Thus, we measure the inference speed on both CPU (Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz) and GPU (Ge Force GTX 1080 Ti) devices.
Software Dependencies No The paper mentions using 'Huggingface's transformers' and provides a URL to their examples but does not specify exact version numbers for any software libraries, frameworks, or dependencies used in their experiments.
Experiment Setup Yes For BERT models, we use BERT-base models and it contains 12 layers of the same model structure without sharing parameters. Each layer contains an attention module with hidden size 768 and 12 channels, a small 768 × 768 Feed-forward (FF) layer followed by 2 larger FF layers (768 × 3072 and 3072 × 768)... We use a relatively smaller learning rate 10−7 and retrain 1 epoch on the sub-sampled training data to complete the fine-tuning procedure.