reproducibilityindex.ai

Flora: Low-Rank Adapters Are Secretly Gradient Compressors

Authors: Yongchang Hao, Yanshuai Cao, Lili Mou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach.
Researcher Affiliation	Collaboration	Yongchang Hao 1 Yanshuai Cao 2 Lili Mou 1 3 Project done during Mitacs internship at Borealis AI. 1Department of Computing Science & Alberta Machine Intelligence Institute (Amii), University of Alberta 2Borealis AI 3Canada CIFAR AI Chair. Correspondence to: Yongchang Hao <yongcha1@ualberta.ca>, Yanshuai Cao <yanshuai.cao@borealisai.com>, Lili Mou <double power.mou@gmail.com>.
Pseudocode	Yes	Algorithm 1 Gradient accumulation with FLORA. Algorithm 2 Momentum with FLORA.
Open Source Code	Yes	Please refer to our repository at https://github.com/BorealisAI/flora-opt.
Open Datasets	Yes	For the summarization task, we train T5 on the XSum dataset (Narayan et al., 2018). ... train GPT-2 on the IWSLT-2017 German-English dataset (Cettolo et al., 2017). ... C4 dataset (Raffel et al., 2020). ... CIFAR-100 (Krizhevsky et al., 2009).
Dataset Splits	Yes	We sweep the learning rate from 10^-5 to 10^-1 with the naive accumulation method on the validation loss. The best learning rate is applied to other methods... The results are reported on the test set based on the checkpoint with the lowest validation loss.
Hardware Specification	No	The paper mentions memory usage for large models like GPT-3 as a motivation but does not specify the hardware (e.g., GPU models, CPU, memory) used for its own experiments.
Software Dependencies	No	The paper mentions using 'the official Adafactor implementation in Optax (DeepMind et al., 2020)' but does not provide specific version numbers for Optax or other software dependencies like Python, PyTorch/JAX, or CUDA.
Experiment Setup	Yes	The physical batch size is set to 1. We sweep the learning rate from 10^-5 to 10^-1... The hyper-parameter κ (resampling interval) is set to 1000 for all runs of FLORA.