reproducibilityindex.ai

DRONE: Data-aware Low-rank Compression for Large NLP Models

Authors: Patrick Chen, Hsiang-Fu Yu, Inderjit Dhillon, Cho-Jui Hsieh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that DRONE is able to improve both model size and inference speed with limited loss in accuracy. Speciﬁcally, DRONE alone achieves 1.92x speedup on the MRPC task with only 1.5% loss in accuracy, and when DRONE is combined with distillation, it further achieves over 12.3x speedup on various natural language inference tasks.
Researcher Affiliation	Collaboration	Patrick H. Chen UCLA Los Angels, CA patrickchen@g.ucla.edu Hsian-fu, Yu Amazon Palo Alto, CA rofu.yu@gmail.com Inderjit S. Dhillon UT Austin & Amazon Palo Alto, CA inderjit@cs.utexas.edu Cho-jui, Hsieh UCLA & Amazon Los Angels, CA chohsieh@cs.ucla.edu
Pseudocode	Yes	Algorithm 1 Data-Aware Low-rank Compression of feed-forward layer. Algorithm 2 Overall Low-rank Model Approximation Algorithm
Open Source Code	Yes	3. (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets	Yes	For LSTMs, we train a 2-layer LSTM-based language model from scratch with hidden sizes 1500 on Penn Treebank Bank (PTB) dataset. For BERT models, we evaluate the pre-trained BERT models on GLUE tasks.
Dataset Splits	Yes	empirically we found randomly sub-sample 10% of the training data sufﬁces to provide good results. Using more data can only provide limited performance boost but comes at a higher cost of longer preprocessing time. Thus, we will use 10% random sample of the training data to perform the experiments.
Hardware Specification	Yes	Thus, we measure the inference speed on both CPU (Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz) and GPU (Ge Force GTX 1080 Ti) devices.
Software Dependencies	No	The paper mentions using 'Huggingface's transformers' and provides a URL to their examples but does not specify exact version numbers for any software libraries, frameworks, or dependencies used in their experiments.
Experiment Setup	Yes	For BERT models, we use BERT-base models and it contains 12 layers of the same model structure without sharing parameters. Each layer contains an attention module with hidden size 768 and 12 channels, a small 768 × 768 Feed-forward (FF) layer followed by 2 larger FF layers (768 × 3072 and 3072 × 768)... We use a relatively smaller learning rate 10−7 and retrain 1 epoch on the sub-sampled training data to complete the ﬁne-tuning procedure.