Thinking Forward: Memory-Efficient Federated Finetuning of Language Models
Authors: Kunjal Panchal, Nisarg Parikh, Sunav Choudhary, Lijun Zhang, Yuriy Brun, Hui Guan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, SPRY reduces the memory footprint during training by 1.4 7.1 in contrast to backpropagation, while reaching comparable accuracy, across a wide range of language tasks, models, and FL settings. [...] We empirically evaluate SPRY s memory efficiency, accuracy, computation efficiency, and communication efficiency through experiments on a wide range of language tasks, models, and FL settings. |
| Researcher Affiliation | Collaboration | Kunjal Panchal University of Massachusetts Amherst, MA 01003-9264 kpanchal@umass.edu Nisarg Parikh University of Massachusetts Amherst, MA 01003-9264 nkparikh@umass.edu Sunav Choudhary Adobe Research Bangalore, India 560103 schoudha@adobe.com Lijun Zhang University of Massachusetts Amherst, MA 01003-9264 lijunzhang@cs.umass.edu Yuriy Brun University of Massachusetts Amherst, MA 01003-9264 brun@cs.umass.edu Hui Guan University of Massachusetts Amherst, MA 01003-9264 huiguan@cs.umass.edu |
| Pseudocode | Yes | Algorithm 1 shows the workflow of SPRY. |
| Open Source Code | Yes | Our source code is available for replication at https://github.com/Astuary/Spry. |
| Open Datasets | Yes | Our evaluation uses 8 datasets: AG News [31] (4-class classification), SST2 [32] (2-class classification), Yelp [31] (2-class classification), Yahoo [31] (10-class classification), SNLI [33] (3-class classification), MNLI [34] (3-class classification), SQu ADv2 [35] (Closed-book question answering), and Multi RC [36] (2-class classification). [...] Available at https: //huggingface.co/datasets/ag_news, https://huggingface.co/datasets/yelp_polarity, https://huggingface.co/datasets/yahoo_answers_topics, Accessed on 15 May, 2024. |
| Dataset Splits | Yes | Each dataset has two versions: (i) Dirichlet α = 1.0 (Homogeneous split), and (ii) Dirichlet α = 0.1 (Heterogeneous split). The default dataset split is across 1,000 clients, except the smallest datasets SST2 and Multi RC, where there are 100 clients. SQu ADv2 has 500 total clients. [...] SST2... This dataset contains 67,349 training samples, 872 validation samples, and 1821 testing samples. |
| Hardware Specification | Yes | We utilized two Nvidia 1080ti to conduct all experiments of sub-billion sized models and billion-sized models for SPRY and its zero-order methods. We used two RTX8000s and two A100s for Llama2-7B and OPT models on backpropagation-based methods respectively. |
| Software Dependencies | No | The paper mentions "SPRY is implemented in Flower [43] library. Quantization is done using Auto GPTQ [44]." However, it does not provide specific version numbers for these software libraries, which is required for reproducibility. |
| Experiment Setup | Yes | Unless otherwise mentioned in dataset-specific paragraphs, the default hyperparameters for each method and for all datasets are stated here. [...] Learning rate for backpropagation-based (FEDAVG, FEDYOGI, and FEDSGD), zero-order-based (FWDLLM, BAFFLE, and FEDMEZO), and first-order-based SPRY is {1e-3, 5e-4, 1e-4, 1e-5}. The batch size is set to 8. The max sequence length is 128. [...] Default LORA r and α are 1 and 1, respectively. All methods use ADAMW as client-side optimizer. Besides FEDAVG, all methods use FEDYOGI as server-side optimizer. |