Thinking Forward: Memory-Efficient Federated Finetuning of Language Models

Authors: Kunjal Panchal, Nisarg Parikh, Sunav Choudhary, Lijun Zhang, Yuriy Brun, Hui Guan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, SPRY reduces the memory footprint during training by 1.4 7.1 in contrast to backpropagation, while reaching comparable accuracy, across a wide range of language tasks, models, and FL settings. [...] We empirically evaluate SPRY s memory efficiency, accuracy, computation efficiency, and communication efficiency through experiments on a wide range of language tasks, models, and FL settings.
Researcher Affiliation Collaboration Kunjal Panchal University of Massachusetts Amherst, MA 01003-9264 kpanchal@umass.edu Nisarg Parikh University of Massachusetts Amherst, MA 01003-9264 nkparikh@umass.edu Sunav Choudhary Adobe Research Bangalore, India 560103 schoudha@adobe.com Lijun Zhang University of Massachusetts Amherst, MA 01003-9264 lijunzhang@cs.umass.edu Yuriy Brun University of Massachusetts Amherst, MA 01003-9264 brun@cs.umass.edu Hui Guan University of Massachusetts Amherst, MA 01003-9264 huiguan@cs.umass.edu
Pseudocode Yes Algorithm 1 shows the workflow of SPRY.
Open Source Code Yes Our source code is available for replication at https://github.com/Astuary/Spry.
Open Datasets Yes Our evaluation uses 8 datasets: AG News [31] (4-class classification), SST2 [32] (2-class classification), Yelp [31] (2-class classification), Yahoo [31] (10-class classification), SNLI [33] (3-class classification), MNLI [34] (3-class classification), SQu ADv2 [35] (Closed-book question answering), and Multi RC [36] (2-class classification). [...] Available at https: //huggingface.co/datasets/ag_news, https://huggingface.co/datasets/yelp_polarity, https://huggingface.co/datasets/yahoo_answers_topics, Accessed on 15 May, 2024.
Dataset Splits Yes Each dataset has two versions: (i) Dirichlet α = 1.0 (Homogeneous split), and (ii) Dirichlet α = 0.1 (Heterogeneous split). The default dataset split is across 1,000 clients, except the smallest datasets SST2 and Multi RC, where there are 100 clients. SQu ADv2 has 500 total clients. [...] SST2... This dataset contains 67,349 training samples, 872 validation samples, and 1821 testing samples.
Hardware Specification Yes We utilized two Nvidia 1080ti to conduct all experiments of sub-billion sized models and billion-sized models for SPRY and its zero-order methods. We used two RTX8000s and two A100s for Llama2-7B and OPT models on backpropagation-based methods respectively.
Software Dependencies No The paper mentions "SPRY is implemented in Flower [43] library. Quantization is done using Auto GPTQ [44]." However, it does not provide specific version numbers for these software libraries, which is required for reproducibility.
Experiment Setup Yes Unless otherwise mentioned in dataset-specific paragraphs, the default hyperparameters for each method and for all datasets are stated here. [...] Learning rate for backpropagation-based (FEDAVG, FEDYOGI, and FEDSGD), zero-order-based (FWDLLM, BAFFLE, and FEDMEZO), and first-order-based SPRY is {1e-3, 5e-4, 1e-4, 1e-5}. The batch size is set to 8. The max sequence length is 128. [...] Default LORA r and α are 1 and 1, respectively. All methods use ADAMW as client-side optimizer. Besides FEDAVG, all methods use FEDYOGI as server-side optimizer.