Flora: Low-Rank Adapters Are Secretly Gradient Compressors
Authors: Yongchang Hao, Yanshuai Cao, Lili Mou
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach. |
| Researcher Affiliation | Collaboration | Yongchang Hao 1 Yanshuai Cao 2 Lili Mou 1 3 Project done during Mitacs internship at Borealis AI. 1Department of Computing Science & Alberta Machine Intelligence Institute (Amii), University of Alberta 2Borealis AI 3Canada CIFAR AI Chair. Correspondence to: Yongchang Hao <yongcha1@ualberta.ca>, Yanshuai Cao <yanshuai.cao@borealisai.com>, Lili Mou <double power.mou@gmail.com>. |
| Pseudocode | Yes | Algorithm 1 Gradient accumulation with FLORA. Algorithm 2 Momentum with FLORA. |
| Open Source Code | Yes | Please refer to our repository at https://github.com/BorealisAI/flora-opt. |
| Open Datasets | Yes | For the summarization task, we train T5 on the XSum dataset (Narayan et al., 2018). ... train GPT-2 on the IWSLT-2017 German-English dataset (Cettolo et al., 2017). ... C4 dataset (Raffel et al., 2020). ... CIFAR-100 (Krizhevsky et al., 2009). |
| Dataset Splits | Yes | We sweep the learning rate from 10^-5 to 10^-1 with the naive accumulation method on the validation loss. The best learning rate is applied to other methods... The results are reported on the test set based on the checkpoint with the lowest validation loss. |
| Hardware Specification | No | The paper mentions memory usage for large models like GPT-3 as a motivation but does not specify the hardware (e.g., GPU models, CPU, memory) used for its own experiments. |
| Software Dependencies | No | The paper mentions using 'the official Adafactor implementation in Optax (DeepMind et al., 2020)' but does not provide specific version numbers for Optax or other software dependencies like Python, PyTorch/JAX, or CUDA. |
| Experiment Setup | Yes | The physical batch size is set to 1. We sweep the learning rate from 10^-5 to 10^-1... The hyper-parameter κ (resampling interval) is set to 1000 for all runs of FLORA. |