Flora: Low-Rank Adapters Are Secretly Gradient Compressors

Authors: Yongchang Hao, Yanshuai Cao, Lili Mou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach.
Researcher Affiliation Collaboration Yongchang Hao 1 Yanshuai Cao 2 Lili Mou 1 3 Project done during Mitacs internship at Borealis AI. 1Department of Computing Science & Alberta Machine Intelligence Institute (Amii), University of Alberta 2Borealis AI 3Canada CIFAR AI Chair. Correspondence to: Yongchang Hao <yongcha1@ualberta.ca>, Yanshuai Cao <yanshuai.cao@borealisai.com>, Lili Mou <double power.mou@gmail.com>.
Pseudocode Yes Algorithm 1 Gradient accumulation with FLORA. Algorithm 2 Momentum with FLORA.
Open Source Code Yes Please refer to our repository at https://github.com/BorealisAI/flora-opt.
Open Datasets Yes For the summarization task, we train T5 on the XSum dataset (Narayan et al., 2018). ... train GPT-2 on the IWSLT-2017 German-English dataset (Cettolo et al., 2017). ... C4 dataset (Raffel et al., 2020). ... CIFAR-100 (Krizhevsky et al., 2009).
Dataset Splits Yes We sweep the learning rate from 10^-5 to 10^-1 with the naive accumulation method on the validation loss. The best learning rate is applied to other methods... The results are reported on the test set based on the checkpoint with the lowest validation loss.
Hardware Specification No The paper mentions memory usage for large models like GPT-3 as a motivation but does not specify the hardware (e.g., GPU models, CPU, memory) used for its own experiments.
Software Dependencies No The paper mentions using 'the official Adafactor implementation in Optax (DeepMind et al., 2020)' but does not provide specific version numbers for Optax or other software dependencies like Python, PyTorch/JAX, or CUDA.
Experiment Setup Yes The physical batch size is set to 1. We sweep the learning rate from 10^-5 to 10^-1... The hyper-parameter κ (resampling interval) is set to 1000 for all runs of FLORA.