LoRA: Low-Rank Adaptation of Large Language Models

Authors: Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the downstream task performance of LoRA on RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2021), and GPT-2 (Radford et al., b), before scaling up to GPT-3 175B (Brown et al., 2020). Our experiments cover a wide range of tasks, from natural language understanding (NLU) to generation (NLG). Specifically, we evaluate on the GLUE (Wang et al., 2019) benchmark for RoBERTa and DeBERTa. We follow the setup of Li & Liang (2021) on GPT-2 for a direct comparison and add Wiki SQL (Zhong et al., 2017) (NL to SQL queries) and SAMSum (Gliwa et al., 2019) (conversation summarization) for large-scale experiments on GPT-3. See Appendix D for more details on the datasets we use.
Researcher Affiliation Collaboration Edward Hu Yelong Shen Phillip Wallis Zeyuan Allen-Zhu Yuanzhi Li Shean Wang Lu Wang Weizhu Chen Microsoft Corporation edward.hu@mila.quebec {yeshe, phwallis, zeyuana, swang, luw, wzchen}@microsoft.com yuanzhil@andrew.cmu.edu
Pseudocode No The paper describes the LoRA mechanism mathematically (e.g., h = W0x + ΔWx = W0x + BAx) and verbally explains its steps, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.
Open Datasets Yes We evaluate on the GLUE (Wang et al., 2019) benchmark for RoBERTa and DeBERTa. ... and add Wiki SQL (Zhong et al., 2017) (NL to SQL queries) and SAMSum (Gliwa et al., 2019) (conversation summarization) for large-scale experiments on GPT-3.
Dataset Splits No The paper refers to "validation accuracy" in its tables (e.g., Table 4) and figures (Figure 2), but it does not explicitly provide specific details on the train/validation/test dataset splits (e.g., percentages, sample counts, or the methodology for creating these splits).
Hardware Specification Yes The inference latency introduced by adapter layers can be significant in an online, short-sequence-length scenario. See the full study in Appendix C. ... We use an NVIDIA Quadro RTX8000. ... We use NVIDIA Tesla V100 for all experiments.
Software Dependencies No The paper mentions software components like "PyTorch models" and the "Hugging Face Transformers library (Wolf et al., 2020)", but it does not specify concrete version numbers for these software dependencies, which would be necessary for full reproducibility.
Experiment Setup Yes We use a random Gaussian initialization for A and zero for B, so ΔW = BA is zero at the beginning of training. We then scale ΔWx by α/r, where α is a constant in r. When optimizing with Adam, tuning α is roughly the same as tuning the learning rate if we scale the initialization appropriately. As a result, we simply set α to the first r we try and do not tune it. This scaling helps to reduce the need to retune hyperparameters when we vary r (Yang & Hu, 2021). ... We follow the conventions set out by (Vaswani et al., 2017; Brown et al., 2020) and use Adam (Loshchilov & Hutter, 2019; Kingma & Ba, 2017) for model optimization and use a Transformer MLP feedforward dimension dffn = 4dmodel. ... To ensure a fair comparison, we make two crucial changes to how we evaluate LoRA when comparing with adapters. First, we use the same batch size for all tasks and use a sequence length of 128 to match the adapter baselines.